Chapter 5
Machine Learning Basics
Deep learning is a specific kind of machine learning. In order to understand
deep learning well, one must have a solid understanding of the basic principles of
machine learning. This chapter provides a brief course in the most important general
principles that will be applied throughout the rest of the book. Novice readers or
those who want a wider perspective are encouraged to consider machine learning
textbooks with a more comprehensive coverage of the fundamentals, such as Murphy
(2012) or Bishop (2006). If you are already familiar with machine learning basics,
feel free to skip ahead to section 5.11. That section covers some perspectives
on traditional machine learning techniques that have strongly influenced the
development of deep learning algorithms.
We begin with a definition of what a learning algorithm is, and present an
example: the linear regression algorithm. We then proceed to describe how the
challenge of fitting the training data differs from the challenge of finding patterns
that generalize to new data. Most machine learning algorithms have settings
called hyperparameters that must be determined external to the learning algorithm
itself; we discuss how to set these using additional data. Machine learning is
essentially a form of applied statistics with increased emphasis on the use of
computers to statistically estimate complicated functions and a decreased emphasis
on proving confidence intervals around these functions; we therefore present the
two central approaches to statistics: frequentist estimators and Bayesian inference.
Most machine learning algorithms can be divided into the categories of supervised
learning and unsupervised learning; we describe these categories and give some
examples of simple learning algorithms from each category. Most deep learning
algorithms are based on an optimization algorithm called stochastic gradient
descent. We describe how to combine various algorithm components such as
98
CHAPTER 5. MACHINE LEARNING BASICS
an optimization algorithm, a cost function, a model, and a dataset to build a
machine learning algorithm. Finally, in section 5.11, we describe some of the
factors that have limited the ability of traditional machine learning to generalize.
These challenges have motivated the development of deep learning algorithms that
overcome these obstacles.
5.1 Learning Algorithms
A machine learning algorithm is an algorithm that is able to learn from data. But
what do we mean by learning? Mitchell (1997) provides the definition “A computer
program is said to learn from experience
E
with respect to some class of tasks
T
and performance measure
P
, if its performance at tasks in
T
, as measured by
P
,
improves with experience
E
.” One can imagine a very wide variety of experiences
E
, tasks
T
, and performance measures
P
, and we do not make any attempt in this
book to provide a formal definition of what may be used for each of these entities.
Instead, the following sections provide intuitive descriptions and examples of the
different kinds of tasks, performance measures and experiences that can be used
to construct machine learning algorithms.
5.1.1 The Task, T
Machine learning allows us to tackle tasks that are too difficult to solve with
fixed programs written and designed by human beings. From a scientific and
philosophical point of view, machine learning is interesting because developing our
understanding of machine learning entails developing our understanding of the
principles that underlie intelligence.
In this relatively formal definition of the word “task,” the process of learning
itself is not the task. Learning is our means of attaining the ability to perform the
task. For example, if we want a robot to be able to walk, then walking is the task.
We could program the robot to learn to walk, or we could attempt to directly write
a program that specifies how to walk manually.
Machine learning tasks are usually described in terms of how the machine
learning system should process an
example
. An example is a collection of
features
that have been quantitatively measured from some object or event that we want
the machine learning system to process. We typically represent an example as a
vector
x R
n
where each entry
x
i
of the vector is another feature. For example,
the features of an image are usually the values of the pixels in the image.
99
CHAPTER 5. MACHINE LEARNING BASICS
Many kinds of tasks can be solved with machine learning. Some of the most
common machine learning tasks include the following:
Classification
: In this type of task, the computer program is asked to specify
which of
k
categories some input belongs to. To solve this task, the learning
algorithm is usually asked to produce a function
f
:
R
n
{
1
, . . . , k}
. When
y
=
f
(
x
), the model assigns an input described by vector
x
to a category
identified by numeric code
y
. There are other variants of the classification
task, for example, where
f
outputs a probability distribution over classes.
An example of a classification task is object recognition, where the input
is an image (usually described as a set of pixel brightness values), and the
output is a numeric code identifying the object in the image. For example,
the Willow Garage PR2 robot is able to act as a waiter that can recognize
different kinds of drinks and deliver them to people on command (Good-
fellow et al., 2010). Modern object recognition is best accomplished with
deep learning (Krizhevsky et al., 2012; Ioffe and Szegedy, 2015). Object
recognition is the same basic technology that allows computers to recognize
faces (Taigman et al., 2014), which can be used to automatically tag people
in photo collections and allow computers to interact more naturally with
their users.
Classification with missing inputs
: Classification becomes more chal-
lenging if the computer program is not guaranteed that every measurement
in its input vector will always be provided. In order to solve the classification
task, the learning algorithm only has to define a single function mapping
from a vector input to a categorical output. When some of the inputs may
be missing, rather than providing a single classification function, the learning
algorithm must learn a set of functions. Each function corresponds to classi-
fying
x
with a different subset of its inputs missing. This kind of situation
arises frequently in medical diagnosis, because many kinds of medical tests
are expensive or invasive. One way to efficiently define such a large set
of functions is to learn a probability distribution over all of the relevant
variables, then solve the classification task by marginalizing out the missing
variables. With
n
input variables, we can now obtain all 2
n
different classifi-
cation functions needed for each possible set of missing inputs, but we only
need to learn a single function describing the joint probability distribution.
See Goodfellow et al. (2013b) for an example of a deep probabilistic model
applied to such a task in this way. Many of the other tasks described in this
section can also be generalized to work with missing inputs; classification
with missing inputs is just one example of what machine learning can do.
100
CHAPTER 5. MACHINE LEARNING BASICS
Regression
: In this type of task, the computer program is asked to predict a
numerical value given some input. To solve this task, the learning algorithm
is asked to output a function
f
:
R
n
R
. This type of task is similar to
classification, except that the format of output is different. An example of
a regression task is the prediction of the expected claim amount that an
insured person will make (used to set insurance premiums), or the prediction
of future prices of securities. These kinds of predictions are also used for
algorithmic trading.
Transcription
: In this type of task, the machine learning system is asked
to observe a relatively unstructured representation of some kind of data and
transcribe it into discrete, textual form. For example, in optical character
recognition, the computer program is shown a photograph containing an
image of text and is asked to return this text in the form of a sequence
of characters (e.g., in ASCII or Unicode format). Google Street View uses
deep learning to process address numbers in this way (Goodfellow et al.,
2014d). Another example is speech recognition, where the computer program
is provided an audio waveform and emits a sequence of characters or word
ID codes describing the words that were spoken in the audio recording. Deep
learning is a crucial component of modern speech recognition systems used
at major companies including Microsoft, IBM and Google (Hinton et al.,
2012b).
Machine translation
: In a machine translation task, the input already
consists of a sequence of symbols in some language, and the computer program
must convert this into a sequence of symbols in another language. This is
commonly applied to natural languages, such as translating from English to
French. Deep learning has recently begun to have an important impact on
this kind of task (Sutskever et al., 2014; Bahdanau et al., 2015).
Structured output
: Structured output tasks involve any task where the
output is a vector (or other data structure containing multiple values) with
important relationships between the different elements. This is a broad
category, and subsumes the transcription and translation tasks described
above, but also many other tasks. One example is parsing—mapping a
natural language sentence into a tree that describes its grammatical structure
and tagging nodes of the trees as being verbs, nouns, or adverbs, and so on.
See Collobert (2011) for an example of deep learning applied to a parsing
task. Another example is pixel-wise segmentation of images, where the
computer program assigns every pixel in an image to a specific category. For
101
CHAPTER 5. MACHINE LEARNING BASICS
example, deep learning can be used to annotate the locations of roads in
aerial photographs (Mnih and Hinton, 2010). The output need not have its
form mirror the structure of the input as closely as in these annotation-style
tasks. For example, in image captioning, the computer program observes an
image and outputs a natural language sentence describing the image (Kiros
et al., 2014a,b; Mao et al., 2015; Vinyals et al., 2015b; Donahue et al., 2014;
Karpathy and Li, 2015; Fang et al., 2015; Xu et al., 2015). These tasks are
called structured output tasks because the program must output several
values that are all tightly inter-related. For example, the words produced by
an image captioning program must form a valid sentence.
Anomaly detection
: In this type of task, the computer program sifts
through a set of events or objects, and flags some of them as being unusual
or atypical. An example of an anomaly detection task is credit card fraud
detection. By modeling your purchasing habits, a credit card company can
detect misuse of your cards. If a thief steals your credit card or credit card
information, the thief’s purchases will often come from a different probability
distribution over purchase types than your own. The credit card company
can prevent fraud by placing a hold on an account as soon as that card has
been used for an uncharacteristic purchase. See Chandola et al. (2009) for a
survey of anomaly detection methods.
Synthesis and sampling
: In this type of task, the machine learning al-
gorithm is asked to generate new examples that are similar to those in the
training data. Synthesis and sampling via machine learning can be useful
for media applications where it can be expensive or boring for an artist to
generate large volumes of content by hand. For example, video games can
automatically generate textures for large objects or landscapes, rather than
requiring an artist to manually label each pixel (Luo et al., 2013). In some
cases, we want the sampling or synthesis procedure to generate some specific
kind of output given the input. For example, in a speech synthesis task, we
provide a written sentence and ask the program to emit an audio waveform
containing a spoken version of that sentence. This is a kind of structured
output task, but with the added qualification that there is no single correct
output for each input, and we explicitly desire a large amount of variation in
the output, in order for the output to seem more natural and realistic.
Imputation of missing values
: In this type of task, the machine learning
algorithm is given a new example
x R
n
, but with some entries
x
i
of
x
missing. The algorithm must provide a prediction of the values of the missing
entries.
102
CHAPTER 5. MACHINE LEARNING BASICS
Denoising
: In this type of task, the machine learning algorithm is given in
input a corrupted example
˜
x R
n
obtained by an unknown corruption process
from a clean example
x R
n
. The learner must predict the clean example
x
from its corrupted version
˜
x
, or more generally predict the conditional
probability distribution p(x |
˜
x).
Density estimation
or
probability mass function estimation
: In
the density estimation problem, the machine learning algorithm is asked
to learn a function
p
model
:
R
n
R
, where
p
model
(
x
) can be interpreted
as a probability density function (if
x
is continuous) or a probability mass
function (if
x
is discrete) on the space that the examples were drawn from.
To do such a task well (we will specify exactly what that means when we
discuss performance measures
P
), the algorithm needs to learn the structure
of the data it has seen. It must know where examples cluster tightly and
where they are unlikely to occur. Most of the tasks described above require
the learning algorithm to at least implicitly capture the structure of the
probability distribution. Density estimation allows us to explicitly capture
that distribution. In principle, we can then perform computations on that
distribution in order to solve the other tasks as well. For example, if we
have performed density estimation to obtain a probability distribution
p
(
x
),
we can use that distribution to solve the missing value imputation task. If
a value
x
i
is missing and all of the other values, denoted
x
i
, are given,
then we know the distribution over it is given by
p
(
x
i
| x
i
). In practice,
density estimation does not always allow us to solve all of these related tasks,
because in many cases the required operations on
p
(
x
) are computationally
intractable.
Of course, many other tasks and types of tasks are possible. The types of tasks
we list here are intended only to provide examples of what machine learning can
do, not to define a rigid taxonomy of tasks.
5.1.2 The Performance Measure, P
In order to evaluate the abilities of a machine learning algorithm, we must design
a quantitative measure of its performance. Usually this performance measure
P
is
specific to the task T being carried out by the system.
For tasks such as classification, classification with missing inputs, and tran-
scription, we often measure the
accuracy
of the model. Accuracy is just the
proportion of examples for which the model produces the correct output. We can
103
CHAPTER 5. MACHINE LEARNING BASICS
also obtain equivalent information by measuring the
error rate
, the proportion
of examples for which the model produces an incorrect output. We often refer to
the error rate as the expected 0-1 loss. The 0-1 loss on a particular example is 0
if it is correctly classified and 1 if it is not. For tasks such as density estimation,
it does not make sense to measure accuracy, error rate, or any other kind of 0-1
loss. Instead, we must use a different performance metric that gives the model
a continuous-valued score for each example. The most common approach is to
report the average log-probability the model assigns to some examples.
Usually we are interested in how well the machine learning algorithm performs
on data that it has not seen before, since this determines how well it will work when
deployed in the real world. We therefore evaluate these performance measures using
a
test set
of data that is separate from the data used for training the machine
learning system.
The choice of performance measure may seem straightforward and objective,
but it is often difficult to choose a performance measure that corresponds well to
the desired behavior of the system.
In some cases, this is because it is difficult to decide what should be measured.
For example, when performing a transcription task, should we measure the accuracy
of the system at transcribing entire sequences, or should we use a more fine-grained
performance measure that gives partial credit for getting some elements of the
sequence correct? When performing a regression task, should we penalize the
system more if it frequently makes medium-sized mistakes or if it rarely makes
very large mistakes? These kinds of design choices depend on the application.
In other cases, we know what quantity we would ideally like to measure, but
measuring it is impractical. For example, this arises frequently in the context of
density estimation. Many of the best probabilistic models represent probability
distributions only implicitly. Computing the actual probability value assigned to
a specific point in space in many such models is intractable. In these cases, one
must design an alternative criterion that still corresponds to the design objectives,
or design a good approximation to the desired criterion.
5.1.3 The Experience, E
Machine learning algorithms can be broadly categorized as
unsupervised
or
supervised
by what kind of experience they are allowed to have during the
learning process.
Most of the learning algorithms in this book can be understood as being allowed
to experience an entire
dataset
. A dataset is a collection of many examples, as
104
CHAPTER 5. MACHINE LEARNING BASICS
defined in section 5.1.1. Sometimes we will also call examples data points.
One of the oldest datasets studied by statisticians and machine learning re-
searchers is the Iris dataset (Fisher, 1936). It is a collection of measurements of
different parts of 150 iris plants. Each individual plant corresponds to one example.
The features within each example are the measurements of each of the parts of the
plant: the sepal length, sepal width, petal length and petal width. The dataset
also records which species each plant belonged to. Three different species are
represented in the dataset.
Unsupervised learning algorithms
experience a dataset containing many
features, then learn useful properties of the structure of this dataset. In the context
of deep learning, we usually want to learn the entire probability distribution that
generated a dataset, whether explicitly as in density estimation or implicitly for
tasks like synthesis or denoising. Some other unsupervised learning algorithms
perform other roles, like clustering, which consists of dividing the dataset into
clusters of similar examples.
Supervised learning algorithms
experience a dataset containing features,
but each example is also associated with a
label
or
target
. For example, the Iris
dataset is annotated with the species of each iris plant. A supervised learning
algorithm can study the Iris dataset and learn to classify iris plants into three
different species based on their measurements.
Roughly speaking, unsupervised learning involves observing several examples
of a random vector
x
, and attempting to implicitly or explicitly learn the proba-
bility distribution
p
(
x
), or some interesting properties of that distribution, while
supervised learning involves observing several examples of a random vector
x
and
an associated value or vector
y
, and learning to predict
y
from
x
, usually by
estimating
p
(
y | x
). The term
supervised learning
originates from the view of
the target
y
being provided by an instructor or teacher who shows the machine
learning system what to do. In unsupervised learning, there is no instructor or
teacher, and the algorithm must learn to make sense of the data without this guide.
Unsupervised learning and supervised learning are not formally defined terms.
The lines between them are often blurred. Many machine learning technologies can
be used to perform both tasks. For example, the chain rule of probability states
that for a vector x R
n
, the joint distribution can be decomposed as
p(x) =
n
i=1
p(x
i
| x
1
, . . . , x
i1
). (5.1)
This decomposition means that we can solve the ostensibly unsupervised problem of
modeling
p
(
x
) by splitting it into
n
supervised learning problems. Alternatively, we
105
CHAPTER 5. MACHINE LEARNING BASICS
can solve the supervised learning problem of learning
p
(
y | x
) by using traditional
unsupervised learning technologies to learn the joint distribution
p
(
x, y
) and
inferring
p(y | x) =
p(x, y)
y
p(x, y
)
. (5.2)
Though unsupervised learning and supervised learning are not completely formal or
distinct concepts, they do help to roughly categorize some of the things we do with
machine learning algorithms. Traditionally, people refer to regression, classification
and structured output problems as supervised learning. Density estimation in
support of other tasks is usually considered unsupervised learning.
Other variants of the learning paradigm are possible. For example, in semi-
supervised learning, some examples include a supervision target but others do
not. In multi-instance learning, an entire collection of examples is labeled as
containing or not containing an example of a class, but the individual members
of the collection are not labeled. For a recent example of multi-instance learning
with deep models, see Kotzias et al. (2015).
Some machine learning algorithms do not just experience a fixed dataset. For
example,
reinforcement learning
algorithms interact with an environment, so
there is a feedback loop between the learning system and its experiences. Such
algorithms are beyond the scope of this book. Please see Sutton and Barto (1998)
or Bertsekas and Tsitsiklis (1996) for information about reinforcement learning,
and Mnih et al. (2013) for the deep learning approach to reinforcement learning.
Most machine learning algorithms simply experience a dataset. A dataset can
be described in many ways. In all cases, a dataset is a collection of examples,
which are in turn collections of features.
One common way of describing a dataset is with a design matrix. A design
matrix is a matrix containing a different example in each row. Each column of the
matrix corresponds to a different feature. For instance, the Iris dataset contains
150 examples with four features for each example. This means we can represent
the dataset with a design matrix
X R
150×4
, where
X
i,1
is the sepal length of
plant
i
,
X
i,2
is the sepal width of plant
i
, etc. We will describe most of the learning
algorithms in this book in terms of how they operate on design matrix datasets.
Of course, to describe a dataset as a design matrix, it must be possible to
describe each example as a vector, and each of these vectors must be the same size.
This is not always possible. For example, if you have a collection of photographs
with different widths and heights, then different photographs will contain different
numbers of pixels, so not all of the photographs may be described with the same
length of vector. Section 9.7 and chapter 10 describe how to handle different
106
CHAPTER 5. MACHINE LEARNING BASICS
types of such heterogeneous data. In cases like these, rather than describing the
dataset as a matrix with
m
rows, we will describe it as a set containing
m
elements:
{x
(1)
, x
(2)
, . . . , x
(m)
}
. This notation does not imply that any two example vectors
x
(i)
and x
(j)
have the same size.
In the case of supervised learning, the example contains a label or target as
well as a collection of features. For example, if we want to use a learning algorithm
to perform object recognition from photographs, we need to specify which object
appears in each of the photos. We might do this with a numeric code, with 0
signifying a person, 1 signifying a car, 2 signifying a cat, etc. Often when working
with a dataset containing a design matrix of feature observations
X
, we also
provide a vector of labels y, with y
i
providing the label for example i.
Of course, sometimes the label may be more than just a single number. For
example, if we want to train a speech recognition system to transcribe entire
sentences, then the label for each example sentence is a sequence of words.
Just as there is no formal definition of supervised and unsupervised learning,
there is no rigid taxonomy of datasets or experiences. The structures described here
cover most cases, but it is always possible to design new ones for new applications.
5.1.4 Example: Linear Regression
Our definition of a machine learning algorithm as an algorithm that is capable
of improving a computer program’s performance at some task via experience is
somewhat abstract. To make this more concrete, we present an example of a
simple machine learning algorithm:
linear regression
. We will return to this
example repeatedly as we introduce more machine learning concepts that help to
understand its behavior.
As the name implies, linear regression solves a regression problem. In other
words, the goal is to build a system that can take a vector
x R
n
as input and
predict the value of a scalar
y R
as its output. In the case of linear regression,
the output is a linear function of the input. Let
ˆy
be the value that our model
predicts y should take on. We define the output to be
ˆy = w
x (5.3)
where w R
n
is a vector of parameters.
Parameters are values that control the behavior of the system. In this case,
w
i
is
the coefficient that we multiply by feature
x
i
before summing up the contributions
from all the features. We can think of
w
as a set of
weights
that determine how
each feature affects the prediction. If a feature
x
i
receives a positive weight
w
i
,
107
CHAPTER 5. MACHINE LEARNING BASICS
then increasing the value of that feature increases the value of our prediction
ˆy
.
If a feature receives a negative weight, then increasing the value of that feature
decreases the value of our prediction. If a feature’s weight is large in magnitude,
then it has a large effect on the prediction. If a feature’s weight is zero, it has no
effect on the prediction.
We thus have a definition of our task
T
: to predict
y
from
x
by outputting
ˆy = w
x. Next we need a definition of our performance measure, P .
Suppose that we have a design matrix of
m
example inputs that we will not
use for training, only for evaluating how well the model performs. We also have
a vector of regression targets providing the correct value of
y
for each of these
examples. Because this dataset will only be used for evaluation, we call it the
test
set
. We refer to the design matrix of inputs as
X
(test)
and the vector of regression
targets as y
(test)
.
One way of measuring the performance of the model is to compute the
mean
squared error
of the model on the test set. If
ˆ
y
(test)
gives the predictions of the
model on the test set, then the mean squared error is given by
MSE
test
=
1
m
i
(
ˆ
y
(test)
y
(test)
)
2
i
. (5.4)
Intuitively, one can see that this error measure decreases to 0 when
ˆ
y
(test)
=
y
(test)
.
We can also see that
MSE
test
=
1
m
||
ˆ
y
(test)
y
(test)
||
2
2
, (5.5)
so the error increases whenever the Euclidean distance between the predictions
and the targets increases.
To make a machine learning algorithm, we need to design an algorithm that
will improve the weights
w
in a way that reduces
MSE
test
when the algorithm
is allowed to gain experience by observing a training set (
X
(train)
, y
(train)
). One
intuitive way of doing this (which we will justify later, in section 5.5.1) is just to
minimize the mean squared error on the training set, MSE
train
.
To minimize MSE
train
, we can simply solve for where its gradient is 0:
w
MSE
train
= 0 (5.6)
w
1
m
||
ˆ
y
(train)
y
(train)
||
2
2
= 0 (5.7)
1
m
w
||X
(train)
w y
(train)
||
2
2
= 0 (5.8)
108
CHAPTER 5. MACHINE LEARNING BASICS
1.0 0.5 0.0 0.5 1.0
x
1
3
2
1
0
1
2
3
y
Linear regression example
0.5 1.0 1.5
w
1
0.20
0.25
0.30
0.35
0.40
0.45
0.50
0.55
MSE
(train)
Optimization of w
Figure 5.1: A linear regression problem, with a training set consisting of ten data points,
each containing one feature. Because there is only one feature, the weight vector
w
contains only a single parameter to learn,
w
1
. (Left)Observe that linear regression learns
to set
w
1
such that the line
y
=
w
1
x
comes as close as possible to passing through all the
training points. (Right)The plotted point indicates the value of
w
1
found by the normal
equations, which we can see minimizes the mean squared error on the training set.
w
X
(train)
w y
(train)
X
(train)
w y
(train)
= 0 (5.9)
w
w
X
(train)
X
(train)
w 2w
X
(train)
y
(train)
+ y
(train)
y
(train)
= 0
(5.10)
2X
(train)
X
(train)
w 2X
(train)
y
(train)
= 0 (5.11)
w =
X
(train)
X
(train)
1
X
(train)
y
(train)
(5.12)
The system of equations whose solution is given by equation 5.12 is known as
the
normal equations
. Evaluating equation 5.12 constitutes a simple learning
algorithm. For an example of the linear regression learning algorithm in action,
see figure 5.1.
It is worth noting that the term
linear regression
is often used to refer to
a slightly more sophisticated model with one additional parameter—an intercept
term b. In this model
ˆy = w
x + b (5.13)
so the mapping from parameters to predictions is still a linear function but the
mapping from features to predictions is now an affine function. This extension to
affine functions means that the plot of the model’s predictions still looks like a
line, but it need not pass through the origin. Instead of adding the bias parameter
109
CHAPTER 5. MACHINE LEARNING BASICS
b
, one can continue to use the model with only weights but augment
x
with an
extra entry that is always set to 1. The weight corresponding to the extra 1 entry
plays the role of the bias parameter. We will frequently use the term “linear” when
referring to affine functions throughout this book.
The intercept term
b
is often called the
bias
parameter of the affine transfor-
mation. This terminology derives from the point of view that the output of the
transformation is biased toward being
b
in the absence of any input. This term
is different from the idea of a statistical bias, in which a statistical estimation
algorithm’s expected estimate of a quantity is not equal to the true quantity.
Linear regression is of course an extremely simple and limited learning algorithm,
but it provides an example of how a learning algorithm can work. In the subsequent
sections we will describe some of the basic principles underlying learning algorithm
design and demonstrate how these principles can be used to build more complicated
learning algorithms.
5.2 Capacity, Overfitting and Underfitting
The central challenge in machine learning is that we must perform well on new,
previously unseen inputs—not just those on which our model was trained. The
ability to perform well on previously unobserved inputs is called
generalization
.
Typically, when training a machine learning model, we have access to a training
set, we can compute some error measure on the training set called the
training
error
, and we reduce this training error. So far, what we have described is simply
an optimization problem. What separates machine learning from optimization is
that we want the
generalization error
, also called the
test error
, to be low as
well. The generalization error is defined as the expected value of the error on a
new input. Here the expectation is taken across different possible inputs, drawn
from the distribution of inputs we expect the system to encounter in practice.
We typically estimate the generalization error of a machine learning model by
measuring its performance on a
test set
of examples that were collected separately
from the training set.
In our linear regression example, we trained the model by minimizing the
training error,
1
m
(train)
||X
(train)
w y
(train)
||
2
2
, (5.14)
but we actually care about the test error,
1
m
(test)
||X
(test)
w y
(test)
||
2
2
.
How can we affect performance on the test set when we get to observe only the
110
CHAPTER 5. MACHINE LEARNING BASICS
training set? The field of
statistical learning theory
provides some answers. If
the training and the test set are collected arbitrarily, there is indeed little we can
do. If we are allowed to make some assumptions about how the training and test
set are collected, then we can make some progress.
The train and test data are generated by a probability distribution over datasets
called the
data generating process
. We typically make a set of assumptions
known collectively as the
i.i.d. assumptions
. These assumptions are that the
examples in each dataset are
independent
from each other, and that the train
set and test set are
identically distributed
, drawn from the same probability
distribution as each other. This assumption allows us to describe the data gen-
erating process with a probability distribution over a single example. The same
distribution is then used to generate every train example and every test example.
We call that shared underlying distribution the
data generating distribution
,
denoted
p
data
. This probabilistic framework and the i.i.d. assumptions allow us to
mathematically study the relationship between training error and test error.
One immediate connection we can observe between the training and test error
is that the expected training error of a randomly selected model is equal to the
expected test error of that model. Suppose we have a probability distribution
p
(
x, y
) and we sample from it repeatedly to generate the train set and the test
set. For some fixed value
w
, the expected training set error is exactly the same as
the expected test set error, because both expectations are formed using the same
dataset sampling process. The only difference between the two conditions is the
name we assign to the dataset we sample.
Of course, when we use a machine learning algorithm, we do not fix the
parameters ahead of time, then sample both datasets. We sample the training set,
then use it to choose the parameters to reduce training set error, then sample the
test set. Under this process, the expected test error is greater than or equal to
the expected value of training error. The factors determining how well a machine
learning algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.
These two factors correspond to the two central challenges in machine learning:
underfitting
and
overfitting
. Underfitting occurs when the model is not able to
obtain a sufficiently low error value on the training set. Overfitting occurs when
the gap between the training error and test error is too large.
We can control whether a model is more likely to overfit or underfit by altering
its
capacity
. Informally, a model’s capacity is its ability to fit a wide variety of
111
CHAPTER 5. MACHINE LEARNING BASICS
functions. Models with low capacity may struggle to fit the training set. Models
with high capacity can overfit by memorizing properties of the training set that do
not serve them well on the test set.
One way to control the capacity of a learning algorithm is by choosing its
hypothesis space, the set of functions that the learning algorithm is allowed to
select as being the solution. For example, the linear regression algorithm has the
set of all linear functions of its input as its hypothesis space. We can generalize
linear regression to include polynomials, rather than just linear functions, in its
hypothesis space. Doing so increases the model’s capacity.
A polynomial of degree one gives us the linear regression model with which we
are already familiar, with prediction
ˆy = b + wx. (5.15)
By introducing
x
2
as another feature provided to the linear regression model, we
can learn a model that is quadratic as a function of x:
ˆy = b + w
1
x + w
2
x
2
. (5.16)
Though this model implements a quadratic function of its input, the output is
still a linear function of the parameters, so we can still use the normal equations
to train the model in closed form. We can continue to add more powers of
x
as
additional features, for example to obtain a polynomial of degree 9:
ˆy = b +
9
i=1
w
i
x
i
. (5.17)
Machine learning algorithms will generally perform best when their capacity
is appropriate for the true complexity of the task they need to perform and the
amount of training data they are provided with. Models with insufficient capacity
are unable to solve complex tasks. Models with high capacity can solve complex
tasks, but when their capacity is higher than needed to solve the present task they
may overfit.
Figure 5.2 shows this principle in action. We compare a linear, quadratic
and degree-9 predictor attempting to fit a problem where the true underlying
function is quadratic. The linear function is unable to capture the curvature in
the true underlying problem, so it underfits. The degree-9 predictor is capable of
representing the correct function, but it is also capable of representing infinitely
many other functions that pass exactly through the training points, because we
112
CHAPTER 5. MACHINE LEARNING BASICS
have more parameters than training examples. We have little chance of choosing
a solution that generalizes well when so many wildly different solutions exist. In
this example, the quadratic model is perfectly matched to the true structure of
the task so it generalizes well to new data.



Figure 5.2: We fit three models to this example training set. The training data was
generated synthetically, by randomly sampling
x
values and choosing
y
deterministically
by evaluating a quadratic function. (Left)A linear function fit to the data suffers from
underfitting—it cannot capture the curvature that is present in the data. (Center)A
quadratic function fit to the data generalizes well to unseen points. It does not suffer from
a significant amount of overfitting or underfitting. (Right)A polynomial of degree 9 fit to
the data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solve
the underdetermined normal equations. The solution passes through all of the training
points exactly, but we have not been lucky enough for it to extract the correct structure.
It now has a deep valley in between two training points that does not appear in the true
underlying function. It also increases sharply on the left side of the data, while the true
function decreases in this area.
So far we have described only one way of changing a model’s capacity: by
changing the number of input features it has, and simultaneously adding new
parameters associated with those features. There are in fact many ways of changing
a model’s capacity. Capacity is not determined only by the choice of model. The
model specifies which family of functions the learning algorithm can choose from
when varying the parameters in order to reduce a training objective. This is called
the
representational capacity
of the model. In many cases, finding the best
function within this family is a very difficult optimization problem. In practice,
the learning algorithm does not actually find the best function, but merely one
that significantly reduces the training error. These additional limitations, such as
113
CHAPTER 5. MACHINE LEARNING BASICS
the imperfection of the optimization algorithm, mean that the learning algorithm’s
effective capacity
may be less than the representational capacity of the model
family.
Our modern ideas about improving the generalization of machine learning
models are refinements of thought dating back to philosophers at least as early
as Ptolemy. Many early scholars invoke a principle of parsimony that is now
most widely known as
Occam’s razor
(c. 1287-1347). This principle states that
among competing hypotheses that explain known observations equally well, one
should choose the “simplest” one. This idea was formalized and made more precise
in the 20th century by the founders of statistical learning theory (Vapnik and
Chervonenkis, 1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995).
Statistical learning theory provides various means of quantifying model capacity.
Among these, the most well-known is the
Vapnik-Chervonenkis dimension
, or
VC dimension. The VC dimension measures the capacity of a binary classifier. The
VC dimension is defined as being the largest possible value of
m
for which there
exists a training set of
m
different
x
points that the classifier can label arbitrarily.
Quantifying the capacity of the model allows statistical learning theory to
make quantitative predictions. The most important results in statistical learning
theory show that the discrepancy between training error and generalization error
is bounded from above by a quantity that grows as the model capacity grows but
shrinks as the number of training examples increases (Vapnik and Chervonenkis,
1971; Vapnik, 1982; Blumer et al., 1989; Vapnik, 1995). These bounds provide
intellectual justification that machine learning algorithms can work, but they are
rarely used in practice when working with deep learning algorithms. This is in
part because the bounds are often quite loose and in part because it can be quite
difficult to determine the capacity of deep learning algorithms. The problem of
determining the capacity of a deep learning model is especially difficult because the
effective capacity is limited by the capabilities of the optimization algorithm, and
we have little theoretical understanding of the very general non-convex optimization
problems involved in deep learning.
We must remember that while simpler functions are more likely to generalize
(to have a small gap between training and test error) we must still choose a
sufficiently complex hypothesis to achieve low training error. Typically, training
error decreases until it asymptotes to the minimum possible error value as model
capacity increases (assuming the error measure has a minimum value). Typically,
generalization error has a U-shaped curve as a function of model capacity. This is
illustrated in figure 5.3.
To reach the most extreme case of arbitrarily high capacity, we introduce
114
CHAPTER 5. MACHINE LEARNING BASICS
0 Optimal Capacity
Capacity
Error
Underfitting zone Overfitting zone
Generalization gap
Training error
Generalization error
Figure 5.3: Typical relationship between capacity and error. Training and test error
behave differently. At the left end of the graph, training error and generalization error
are both high. This is the
underfitting regime
. As we increase capacity, training error
decreases, but the gap between training and generalization error increases. Eventually,
the size of this gap outweighs the decrease in training error, and we enter the
overfitting
regime, where capacity is too large, above the optimal capacity.
the concept of
non-parametric
models. So far, we have seen only parametric
models, such as linear regression. Parametric models learn a function described
by a parameter vector whose size is finite and fixed before any data is observed.
Non-parametric models have no such limitation.
Sometimes, non-parametric models are just theoretical abstractions (such as
an algorithm that searches over all possible probability distributions) that cannot
be implemented in practice. However, we can also design practical non-parametric
models by making their complexity a function of the training set size. One example
of such an algorithm is
nearest neighbor regression
. Unlike linear regression,
which has a fixed-length vector of weights, the nearest neighbor regression model
simply stores the
X
and
y
from the training set. When asked to classify a test
point
x
, the model looks up the nearest entry in the training set and returns the
associated regression target. In other words,
ˆy
=
y
i
where
i
=
arg min ||X
i,:
x||
2
2
.
The algorithm can also be generalized to distance metrics other than the
L
2
norm,
such as learned distance metrics (Goldberger et al., 2005). If the algorithm is
allowed to break ties by averaging the
y
i
values for all
X
i,:
that are tied for nearest,
then this algorithm is able to achieve the minimum possible training error (which
might be greater than zero, if two identical inputs are associated with different
outputs) on any regression dataset.
Finally, we can also create a non-parametric learning algorithm by wrapping a
115
CHAPTER 5. MACHINE LEARNING BASICS
parametric learning algorithm inside another algorithm that increases the number
of parameters as needed. For example, we could imagine an outer loop of learning
that changes the degree of the polynomial learned by linear regression on top of a
polynomial expansion of the input.
The ideal model is an oracle that simply knows the true probability distribution
that generates the data. Even such a model will still incur some error on many
problems, because there may still be some noise in the distribution. In the case
of supervised learning, the mapping from
x
to
y
may be inherently stochastic,
or
y
may be a deterministic function that involves other variables besides those
included in
x
. The error incurred by an oracle making predictions from the true
distribution p(x, y) is called the Bayes error.
Training and generalization error vary as the size of the training set varies.
Expected generalization error can never increase as the number of training examples
increases. For non-parametric models, more data yields better generalization until
the best possible error is achieved. Any fixed parametric model with less than
optimal capacity will asymptote to an error value that exceeds the Bayes error. See
figure 5.4 for an illustration. Note that it is possible for the model to have optimal
capacity and yet still have a large gap between training and generalization error.
In this situation, we may be able to reduce this gap by gathering more training
examples.
5.2.1 The No Free Lunch Theorem
Learning theory claims that a machine learning algorithm can generalize well from
a finite training set of examples. This seems to contradict some basic principles of
logic. Inductive reasoning, or inferring general rules from a limited set of examples,
is not logically valid. To logically infer a rule describing every member of a set,
one must have information about every member of that set.
In part, machine learning avoids this problem by offering only probabilistic rules,
rather than the entirely certain rules used in purely logical reasoning. Machine
learning promises to find rules that are probably correct about most members of
the set they concern.
Unfortunately, even this does not resolve the entire problem. The
no free
lunch theorem
for machine learning (Wolpert, 1996) states that, averaged over
all possible data generating distributions, every classification algorithm has the
same error rate when classifying previously unobserved points. In other words,
in some sense, no machine learning algorithm is universally any better than any
other. The most sophisticated algorithm we can conceive of has the same average
116
CHAPTER 5. MACHINE LEARNING BASICS
































Figure 5.4: The effect of the training dataset size on the train and test error, as well as
on the optimal model capacity. We constructed a synthetic regression problem based on
adding a moderate amount of noise to a degree-5 polynomial, generated a single test set,
and then generated several different sizes of training set. For each size, we generated 40
different training sets in order to plot error bars showing 95 percent confidence intervals.
(Top)The MSE on the training and test set for two different models: a quadratic model,
and a model with degree chosen to minimize the test error. Both are fit in closed form. For
the quadratic model, the training error increases as the size of the training set increases.
This is because larger datasets are harder to fit. Simultaneously, the test error decreases,
because fewer incorrect hypotheses are consistent with the training data. The quadratic
model does not have enough capacity to solve the task, so its test error asymptotes to
a high value. The test error at optimal capacity asymptotes to the Bayes error. The
training error can fall below the Bayes error, due to the ability of the training algorithm
to memorize specific instances of the training set. As the training size increases to infinity,
the training error of any fixed-capacity model (here, the quadratic model) must rise to at
least the Bayes error. (Bottom)As the training set size increases, the optimal capacity
(shown here as the degree of the optimal polynomial regressor) increases. The optimal
capacity plateaus after reaching sufficient complexity to solve the task.
117
CHAPTER 5. MACHINE LEARNING BASICS
performance (over all possible tasks) as merely predicting that every point belongs
to the same class.
Fortunately, these results hold only when we average over all possible data
generating distributions. If we make assumptions about the kinds of probability
distributions we encounter in real-world applications, then we can design learning
algorithms that perform well on these distributions.
This means that the goal of machine learning research is not to seek a universal
learning algorithm or the absolute best learning algorithm. Instead, our goal is to
understand what kinds of distributions are relevant to the “real world” that an AI
agent experiences, and what kinds of machine learning algorithms perform well on
data drawn from the kinds of data generating distributions we care about.
5.2.2 Regularization
The no free lunch theorem implies that we must design our machine learning
algorithms to perform well on a specific task. We do so by building a set of
preferences into the learning algorithm. When these preferences are aligned with
the learning problems we ask the algorithm to solve, it performs better.
So far, the only method of modifying a learning algorithm that we have discussed
concretely is to increase or decrease the model’s representational capacity by adding
or removing functions from the hypothesis space of solutions the learning algorithm
is able to choose. We gave the specific example of increasing or decreasing the
degree of a polynomial for a regression problem. The view we have described so
far is oversimplified.
The behavior of our algorithm is strongly affected not just by how large we
make the set of functions allowed in its hypothesis space, but by the specific identity
of those functions. The learning algorithm we have studied so far, linear regression,
has a hypothesis space consisting of the set of linear functions of its input. These
linear functions can be very useful for problems where the relationship between
inputs and outputs truly is close to linear. They are less useful for problems
that behave in a very nonlinear fashion. For example, linear regression would
not perform very well if we tried to use it to predict
sin
(
x
) from
x
. We can thus
control the performance of our algorithms by choosing what kind of functions we
allow them to draw solutions from, as well as by controlling the amount of these
functions.
We can also give a learning algorithm a preference for one solution in its
hypothesis space to another. This means that both functions are eligible, but one
is preferred. The unpreferred solution will be chosen only if it fits the training
118
CHAPTER 5. MACHINE LEARNING BASICS
data significantly better than the preferred solution.
For example, we can modify the training criterion for linear regression to include
weight decay
. To perform linear regression with weight decay, we minimize a sum
comprising both the mean squared error on the training and a criterion
J
(
w
) that
expresses a preference for the weights to have smaller squared
L
2
norm. Specifically,
J(w) = MSE
train
+ λw
w, (5.18)
where
λ
is a value chosen ahead of time that controls the strength of our preference
for smaller weights. When
λ
= 0, we impose no preference, and larger
λ
forces the
weights to become smaller. Minimizing
J
(
w
) results in a choice of weights that
make a tradeoff between fitting the training data and being small. This gives us
solutions that have a smaller slope, or put weight on fewer of the features. As an
example of how we can control a model’s tendency to overfit or underfit via weight
decay, we can train a high-degree polynomial regression model with different values
of λ. See figure 5.5 for the results.






Figure 5.5: We fit a high-degree polynomial regression model to our example training set
from figure 5.2. The true function is quadratic, but here we use only models with degree 9.
We vary the amount of weight decay to prevent these high-degree models from overfitting.
(Left)With very large
λ
, we can force the model to learn a function with no slope at
all. This underfits because it can only represent a constant function. (Center)With a
medium value of λ, the learning algorithm recovers a curve with the right general shape.
Even though the model is capable of representing functions with much more complicated
shape, weight decay has encouraged it to use a simpler function described by smaller
coefficients. (Right)With weight decay approaching zero (i.e., using the Moore-Penrose
pseudoinverse to solve the underdetermined problem with minimal regularization), the
degree-9 polynomial overfits significantly, as we saw in figure 5.2.
119
CHAPTER 5. MACHINE LEARNING BASICS
More generally, we can regularize a model that learns a function
f
(
x
;
θ
) by
adding a penalty called a
regularizer
to the cost function. In the case of weight
decay, the regularizer is Ω(
w
) =
w
w
. In chapter 7, we will see that many other
regularizers are possible.
Expressing preferences for one function over another is a more general way
of controlling a model’s capacity than including or excluding members from the
hypothesis space. We can think of excluding a function from a hypothesis space as
expressing an infinitely strong preference against that function.
In our weight decay example, we expressed our preference for linear functions
defined with smaller weights explicitly, via an extra term in the criterion we
minimize. There are many other ways of expressing preferences for different
solutions, both implicitly and explicitly. Together, these different approaches
are known as
regularization
. Regularization is any modification we make to a
learning algorithm that is intended to reduce its generalization error but not its
training error. Regularization is one of the central concerns of the field of machine
learning, rivaled in its importance only by optimization.
The no free lunch theorem has made it clear that there is no best machine
learning algorithm, and, in particular, no best form of regularization. Instead
we must choose a form of regularization that is well-suited to the particular task
we want to solve. The philosophy of deep learning in general and this book in
particular is that a very wide range of tasks (such as all of the intellectual tasks
that people can do) may all be solved effectively using very general-purpose forms
of regularization.
5.3 Hyperparameters and Validation Sets
Most machine learning algorithms have several settings that we can use to control
the behavior of the learning algorithm. These settings are called
hyperparame-
ters
. The values of hyperparameters are not adapted by the learning algorithm
itself (though we can design a nested learning procedure where one learning
algorithm learns the best hyperparameters for another learning algorithm).
In the polynomial regression example we saw in figure 5.2, there is a single
hyperparameter: the degree of the polynomial, which acts as a
capacity
hyper-
parameter. The
λ
value used to control the strength of weight decay is another
example of a hyperparameter.
Sometimes a setting is chosen to be a hyperparameter that the learning al-
gorithm does not learn because it is difficult to optimize. More frequently, the
120
CHAPTER 5. MACHINE LEARNING BASICS
setting must be a hyperparameter because it is not appropriate to learn that
hyperparameter on the training set. This applies to all hyperparameters that
control model capacity. If learned on the training set, such hyperparameters would
always choose the maximum possible model capacity, resulting in overfitting (refer
to figure 5.3). For example, we can always fit the training set better with a higher
degree polynomial and a weight decay setting of
λ
= 0 than we could with a lower
degree polynomial and a positive weight decay setting.
To solve this problem, we need a
validation set
of examples that the training
algorithm does not observe.
Earlier we discussed how a held-out test set, composed of examples coming from
the same distribution as the training set, can be used to estimate the generalization
error of a learner, after the learning process has completed. It is important that the
test examples are not used in any way to make choices about the model, including
its hyperparameters. For this reason, no example from the test set can be used
in the validation set. Therefore, we always construct the validation set from the
training data. Specifically, we split the training data into two disjoint subsets. One
of these subsets is used to learn the parameters. The other subset is our validation
set, used to estimate the generalization error during or after training, allowing
for the hyperparameters to be updated accordingly. The subset of data used to
learn the parameters is still typically called the training set, even though this
may be confused with the larger pool of data used for the entire training process.
The subset of data used to guide the selection of hyperparameters is called the
validation set. Typically, one uses about 80% of the training data for training and
20% for validation. Since the validation set is used to “train” the hyperparameters,
the validation set error will underestimate the generalization error, though typically
by a smaller amount than the training error. After all hyperparameter optimization
is complete, the generalization error may be estimated using the test set.
In practice, when the same test set has been used repeatedly to evaluate
performance of different algorithms over many years, and especially if we consider
all the attempts from the scientific community at beating the reported state-of-
the-art performance on that test set, we end up having optimistic evaluations with
the test set as well. Benchmarks can thus become stale and then do not reflect the
true field performance of a trained system. Thankfully, the community tends to
move on to new (and usually more ambitious and larger) benchmark datasets.
121
CHAPTER 5. MACHINE LEARNING BASICS
5.3.1 Cross-Validation
Dividing the dataset into a fixed training set and a fixed test set can be problematic
if it results in the test set being small. A small test set implies statistical uncertainty
around the estimated average test error, making it difficult to claim that algorithm
A works better than algorithm B on the given task.
When the dataset has hundreds of thousands of examples or more, this is not a
serious issue. When the dataset is too small, are alternative procedures enable one
to use all of the examples in the estimation of the mean test error, at the price of
increased computational cost. These procedures are based on the idea of repeating
the training and testing computation on different randomly chosen subsets or splits
of the original dataset. The most common of these is the
k
-fold cross-validation
procedure, shown in algorithm 5.1, in which a partition of the dataset is formed by
splitting it into
k
non-overlapping subsets. The test error may then be estimated
by taking the average test error across
k
trials. On trial
i
, the
i
-th subset of the
data is used as the test set and the rest of the data is used as the training set. One
problem is that there exist no unbiased estimators of the variance of such average
error estimators (Bengio and Grandvalet, 2004), but approximations are typically
used.
5.4 Estimators, Bias and Variance
The field of statistics gives us many tools that can be used to achieve the machine
learning goal of solving a task not only on the training set but also to generalize.
Foundational concepts such as parameter estimation, bias and variance are useful
to formally characterize notions of generalization, underfitting and overfitting.
5.4.1 Point Estimation
Point estimation is the attempt to provide the single “best” prediction of some
quantity of interest. In general the quantity of interest can be a single parameter
or a vector of parameters in some parametric model, such as the weights in our
linear regression example in section 5.1.4, but it can also be a whole function.
In order to distinguish estimates of parameters from their true value, our
convention will be to denote a point estimate of a parameter θ by
ˆ
θ.
Let
{x
(1)
, . . . , x
(m)
}
be a set of
m
independent and identically distributed
122
CHAPTER 5. MACHINE LEARNING BASICS
Algorithm 5.1
The
k
-fold cross-validation algorithm. It can be used to estimate
generalization error of a learning algorithm
A
when the given dataset
D
is too
small for a simple train/test or train/valid split to yield accurate estimation of
generalization error, because the mean of a loss
L
on a small test set may have too
high variance. The dataset
D
contains as elements the abstract examples
z
(i)
(for
the
i
-th example), which could stand for an (input,target) pair
z
(i)
= (
x
(i)
, y
(i)
)
in the case of supervised learning, or for just an input
z
(i)
=
x
(i)
in the case
of unsupervised learning. The algorithm returns the vector of errors
e
for each
example in
D
, whose mean is the estimated generalization error. The errors on
individual examples can be used to compute a confidence interval around the mean
(equation 5.47). While these confidence intervals are not well-justified after the
use of cross-validation, it is still common practice to use them to declare that
algorithm
A
is better than algorithm
B
only if the confidence interval of the error
of algorithm
A
lies below and does not intersect the confidence interval of algorithm
B.
Define KFoldXV(D, A, L, k):
Require: D, the given dataset, with elements z
(i)
Require: A
, the learning algorithm, seen as a function that takes a dataset as
input and outputs a learned function
Require: L
, the loss function, seen as a function from a learned function
f
and
an example z
(i)
D to a scalar R
Require: k, the number of folds
Split D into k mutually exclusive subsets D
i
, whose union is D.
for i from 1 to k do
f
i
= A(D\D
i
)
for z
(j)
in D
i
do
e
j
= L(f
i
, z
(j)
)
end for
end for
Return e
123
CHAPTER 5. MACHINE LEARNING BASICS
(i.i.d.) data points. A point estimator or statistic is any function of the data:
ˆ
θ
m
= g(x
(1)
, . . . , x
(m)
). (5.19)
The definition does not require that
g
return a value that is close to the true
θ
or even that the range of
g
is the same as the set of allowable values of
θ
.
This definition of a point estimator is very general and allows the designer of an
estimator great flexibility. While almost any function thus qualifies as an estimator,
a good estimator is a function whose output is close to the true underlying
θ
that
generated the training data.
For now, we take the frequentist perspective on statistics. That is, we assume
that the true parameter value
θ
is fixed but unknown, while the point estimate
ˆ
θ
is a function of the data. Since the data is drawn from a random process, any
function of the data is random. Therefore
ˆ
θ is a random variable.
Point estimation can also refer to the estimation of the relationship between
input and target variables. We refer to these types of point estimates as function
estimators.
Function Estimation
As we mentioned above, sometimes we are interested in
performing function estimation (or function approximation). Here we are trying to
predict a variable
y
given an input vector
x
. We assume that there is a function
f
(
x
) that describes the approximate relationship between
y
and
x
. For example,
we may assume that
y
=
f
(
x
) +
, where
stands for the part of
y
that is not
predictable from
x
. In function estimation, we are interested in approximating
f
with a model or estimate
ˆ
f
. Function estimation is really just the same as
estimating a parameter
θ
; the function estimator
ˆ
f
is simply a point estimator in
function space. The linear regression example (discussed above in section 5.1.4) and
the polynomial regression example (discussed in section 5.2) are both examples of
scenarios that may be interpreted either as estimating a parameter
w
or estimating
a function
ˆ
f mapping from x to y.
We now review the most commonly studied properties of point estimators and
discuss what they tell us about these estimators.
5.4.2 Bias
The bias of an estimator is defined as:
bias(
ˆ
θ
m
) = E(
ˆ
θ
m
) θ (5.20)
124
CHAPTER 5. MACHINE LEARNING BASICS
where the expectation is over the data (seen as samples from a random variable)
and
θ
is the true underlying value of
θ
used to define the data generating distri-
bution. An estimator
ˆ
θ
m
is said to be
unbiased
if
bias
(
ˆ
θ
m
) =
0
, which implies
that
E
(
ˆ
θ
m
) =
θ
. An estimator
ˆ
θ
m
is said to be
asymptotically unbiased
if
lim
m→∞
bias(
ˆ
θ
m
) = 0, which implies that lim
m→∞
E(
ˆ
θ
m
) = θ.
Example: Bernoulli Distribution
Consider a set of samples
{x
(1)
, . . . , x
(m)
}
that are independently and identically distributed according to a Bernoulli distri-
bution with mean θ:
P (x
(i)
; θ) = θ
x
(i)
(1 θ)
(1x
(i)
)
. (5.21)
A common estimator for the
θ
parameter of this distribution is the mean of the
training samples:
ˆ
θ
m
=
1
m
m
i=1
x
(i)
. (5.22)
To determine whether this estimator is biased, we can substitute equation 5.22
into equation 5.20:
bias(
ˆ
θ
m
) = E[
ˆ
θ
m
] θ (5.23)
= E
1
m
m
i=1
x
(i)
θ (5.24)
=
1
m
m
i=1
E
x
(i)
θ (5.25)
=
1
m
m
i=1
1
x
(i)
=0
x
(i)
θ
x
(i)
(1 θ)
(1x
(i)
)
θ (5.26)
=
1
m
m
i=1
(θ) θ (5.27)
= θ θ = 0 (5.28)
Since bias(
ˆ
θ) = 0, we say that our estimator
ˆ
θ is unbiased.
Example: Gaussian Distribution Estimator of the Mean
Now, consider
a set of samples
{x
(1)
, . . . , x
(m)
}
that are independently and identically distributed
according to a Gaussian distribution
p
(
x
(i)
) =
N
(
x
(i)
;
µ, σ
2
), where
i {
1
, . . . , m}
.
125
CHAPTER 5. MACHINE LEARNING BASICS
Recall that the Gaussian probability density function is given by
p(x
(i)
; µ, σ
2
) =
1
2πσ
2
exp
1
2
(x
(i)
µ)
2
σ
2
. (5.29)
A common estimator of the Gaussian mean parameter is known as the
sample
mean:
ˆµ
m
=
1
m
m
i=1
x
(i)
(5.30)
To determine the bias of the sample mean, we are again interested in calculating
its expectation:
bias(ˆµ
m
) = E[ˆµ
m
] µ (5.31)
= E
1
m
m
i=1
x
(i)
µ (5.32)
=
1
m
m
i=1
E
x
(i)
µ (5.33)
=
1
m
m
i=1
µ
µ (5.34)
= µ µ = 0 (5.35)
Thus we find that the sample mean is an unbiased estimator of Gaussian mean
parameter.
Example: Estimators of the Variance of a Gaussian Distribution
As an
example, we compare two different estimators of the variance parameter
σ
2
of a
Gaussian distribution. We are interested in knowing if either estimator is biased.
The first estimator of σ
2
we consider is known as the sample variance:
ˆσ
2
m
=
1
m
m
i=1
x
(i)
ˆµ
m
2
, (5.36)
where
ˆµ
m
is the sample mean, defined above. More formally, we are interested in
computing
bias(ˆσ
2
m
) = E[ˆσ
2
m
] σ
2
(5.37)
126
CHAPTER 5. MACHINE LEARNING BASICS
We begin by evaluating the term E[ˆσ
2
m
]:
E[ˆσ
2
m
] =E
1
m
m
i=1
x
(i)
ˆµ
m
2
(5.38)
=
m 1
m
σ
2
(5.39)
Returning to equation 5.37, we conclude that the bias of
ˆσ
2
m
is
σ
2
/m
. Therefore,
the sample variance is a biased estimator.
The unbiased sample variance estimator
˜σ
2
m
=
1
m 1
m
i=1
x
(i)
ˆµ
m
2
(5.40)
provides an alternative approach. As the name suggests this estimator is unbiased.
That is, we find that E[˜σ
2
m
] = σ
2
:
E[˜σ
2
m
] = E
1
m 1
m
i=1
x
(i)
ˆµ
m
2
(5.41)
=
m
m 1
E[ˆσ
2
m
] (5.42)
=
m
m 1
m 1
m
σ
2
(5.43)
= σ
2
. (5.44)
We have two estimators: one is biased and the other is not. While unbiased
estimators are clearly desirable, they are not always the “best” estimators. As we
will see we often use biased estimators that possess other important properties.
5.4.3 Variance and Standard Error
Another property of the estimator that we might want to consider is how much
we expect it to vary as a function of the data sample. Just as we computed the
expectation of the estimator to determine its bias, we can compute its variance.
The variance of an estimator is simply the variance
Var(
ˆ
θ) (5.45)
where the random variable is the training set. Alternately, the square root of the
variance is called the standard error, denoted SE(
ˆ
θ).
127
CHAPTER 5. MACHINE LEARNING BASICS
The variance or the standard error of an estimator provides a measure of how
we would expect the estimate we compute from data to vary as we independently
resample the dataset from the underlying data generating process. Just as we
might like an estimator to exhibit low bias we would also like it to have relatively
low variance.
When we compute any statistic using a finite number of samples, our estimate
of the true underlying parameter is uncertain, in the sense that we could have
obtained other samples from the same distribution and their statistics would have
been different. The expected degree of variation in any estimator is a source of
error that we want to quantify.
The standard error of the mean is given by
SE(ˆµ
m
) =
Var
1
m
m
i=1
x
(i)
=
σ
m
, (5.46)
where
σ
2
is the true variance of the samples
x
i
. The standard error is often
estimated by using an estimate of
σ
. Unfortunately, neither the square root of
the sample variance nor the square root of the unbiased estimator of the variance
provide an unbiased estimate of the standard deviation. Both approaches tend
to underestimate the true standard deviation, but are still used in practice. The
square root of the unbiased estimator of the variance is less of an underestimate.
For large m, the approximation is quite reasonable.
The standard error of the mean is very useful in machine learning experiments.
We often estimate the generalization error by computing the sample mean of the
error on the test set. The number of examples in the test set determines the
accuracy of this estimate. Taking advantage of the central limit theorem, which
tells us that the mean will be approximately distributed with a normal distribution,
we can use the standard error to compute the probability that the true expectation
falls in any chosen interval. For example, the 95% confidence interval centered on
the mean ˆµ
m
is
(ˆµ
m
1.96SE(ˆµ
m
), ˆµ
m
+ 1.96SE(ˆµ
m
)), (5.47)
under the normal distribution with mean
ˆµ
m
and variance
SE
(
ˆµ
m
)
2
. In machine
learning experiments, it is common to say that algorithm
A
is better than algorithm
B
if the upper bound of the 95% confidence interval for the error of algorithm
A
is
less than the lower bound of the 95% confidence interval for the error of algorithm
B.
128
CHAPTER 5. MACHINE LEARNING BASICS
Example: Bernoulli Distribution
We once again consider a set of samples
{x
(1)
, . . . , x
(m)
}
drawn independently and identically from a Bernoulli distribution
(recall
P
(
x
(i)
;
θ
) =
θ
x
(i)
(1
θ
)
(1x
(i)
)
). This time we are interested in computing
the variance of the estimator
ˆ
θ
m
=
1
m
m
i=1
x
(i)
.
Var
ˆ
θ
m
= Var
1
m
m
i=1
x
(i)
(5.48)
=
1
m
2
m
i=1
Var
x
(i)
(5.49)
=
1
m
2
m
i=1
θ(1 θ) (5.50)
=
1
m
2
(1 θ) (5.51)
=
1
m
θ(1 θ) (5.52)
The variance of the estimator decreases as a function of
m
, the number of examples
in the dataset. This is a common property of popular estimators that we will
return to when we discuss consistency (see section 5.4.5).
5.4.4 Trading off Bias and Variance to Minimize Mean Squared
Error
Bias and variance measure two different sources of error in an estimator. Bias
measures the expected deviation from the true value of the function or parameter.
Variance on the other hand, provides a measure of the deviation from the expected
estimator value that any particular sampling of the data is likely to cause.
What happens when we are given a choice between two estimators, one with
more bias and one with more variance? How do we choose between them? For
example, imagine that we are interested in approximating the function shown in
figure 5.2 and we are only offered the choice between a model with large bias and
one that suffers from large variance. How do we choose between them?
The most common way to negotiate this trade-off is to use cross-validation.
Empirically, cross-validation is highly successful on many real-world tasks. Alter-
natively, we can also compare the
mean squared error
(MSE) of the estimates:
MSE = E[(
ˆ
θ
m
θ)
2
] (5.53)
= Bias(
ˆ
θ
m
)
2
+ Var(
ˆ
θ
m
) (5.54)
129
CHAPTER 5. MACHINE LEARNING BASICS
The MSE measures the overall expected deviation—in a squared error sense—
between the estimator and the true value of the parameter
θ
. As is clear from
equation 5.54, evaluating the MSE incorporates both the bias and the variance.
Desirable estimators are those with small MSE and these are estimators that
manage to keep both their bias and variance somewhat in check.
Capacity
Bias
Generalization
error
Variance
Optimal
capacity
Overfitting zoneUnderfitting zone
Figure 5.6: As capacity increases (
x
-axis), bias (dotted) tends to decrease and variance
(dashed) tends to increase, yielding another U-shaped curve for generalization error (bold
curve). If we vary capacity along one axis, there is an optimal capacity, with underfitting
when the capacity is below this optimum and overfitting when it is above. This relationship
is similar to the relationship between capacity, underfitting, and overfitting, discussed in
section 5.2 and figure 5.3.
The relationship between bias and variance is tightly linked to the machine
learning concepts of capacity, underfitting and overfitting. In the case where gen-
eralization error is measured by the MSE (where bias and variance are meaningful
components of generalization error), increasing capacity tends to increase variance
and decrease bias. This is illustrated in figure 5.6, where we see again the U-shaped
curve of generalization error as a function of capacity.
5.4.5 Consistency
So far we have discussed the properties of various estimators for a training set of
fixed size. Usually, we are also concerned with the behavior of an estimator as the
amount of training data grows. In particular, we usually wish that, as the number
of data points
m
in our dataset increases, our point estimates converge to the true
130
CHAPTER 5. MACHINE LEARNING BASICS
value of the corresponding parameters. More formally, we would like that
plim
m→∞
ˆ
θ
m
= θ. (5.55)
The symbol
plim
indicates convergence in probability, meaning that for any
>
0,
P
(
|
ˆ
θ
m
θ| >
)
0 as
m
. The condition described by equation 5.55 is
known as
consistency
. It is sometimes referred to as weak consistency, with
strong consistency referring to the
almost sure
convergence of
ˆ
θ
to
θ
.
Almost
sure convergence
of a sequence of random variables
x
(1)
, x
(2)
, . . .
to a value
x
occurs when p(lim
m→∞
x
(m)
= x) = 1.
Consistency ensures that the bias induced by the estimator diminishes as the
number of data examples grows. However, the reverse is not true—asymptotic
unbiasedness does not imply consistency. For example, consider estimating the
mean parameter
µ
of a normal distribution
N
(
x
;
µ, σ
2
), with a dataset consisting
of
m
samples:
{x
(1)
, . . . , x
(m)
}
. We could use the first sample
x
(1)
of the dataset
as an unbiased estimator:
ˆ
θ
=
x
(1)
. In that case,
E
(
ˆ
θ
m
) =
θ
so the estimator
is unbiased no matter how many data points are seen. This, of course, implies
that the estimate is asymptotically unbiased. However, this is not a consistent
estimator as it is not the case that
ˆ
θ
m
θ as m .
5.5 Maximum Likelihood Estimation
Previously, we have seen some definitions of common estimators and analyzed
their properties. But where did these estimators come from? Rather than guessing
that some function might make a good estimator and then analyzing its bias and
variance, we would like to have some principle from which we can derive specific
functions that are good estimators for different models.
The most common such principle is the maximum likelihood principle.
Consider a set of
m
examples
X
=
{x
(1)
, . . . , x
(m)
}
drawn independently from
the true but unknown data generating distribution p
data
(x).
Let
p
model
(
x
;
θ
) be a parametric family of probability distributions over the
same space indexed by
θ
. In other words,
p
model
(
x
;
θ
) maps any configuration
x
to a real number estimating the true probability p
data
(x).
The maximum likelihood estimator for θ is then defined as
θ
ML
= arg max
θ
p
model
(X; θ) (5.56)
= arg max
θ
m
i=1
p
model
(x
(i)
; θ) (5.57)
131
CHAPTER 5. MACHINE LEARNING BASICS
This product over many probabilities can be inconvenient for a variety of reasons.
For example, it is prone to numerical underflow. To obtain a more convenient
but equivalent optimization problem, we observe that taking the logarithm of the
likelihood does not change its
arg max
but does conveniently transform a product
into a sum:
θ
ML
= arg max
θ
m
i=1
log p
model
(x
(i)
; θ). (5.58)
Because the
arg max
does not change when we rescale the cost function, we can
divide by
m
to obtain a version of the criterion that is expressed as an expectation
with respect to the empirical distribution ˆp
data
defined by the training data:
θ
ML
= arg max
θ
E
xˆp
data
log p
model
(x; θ). (5.59)
One way to interpret maximum likelihood estimation is to view it as minimizing
the dissimilarity between the empirical distribution
ˆp
data
defined by the training
set and the model distribution, with the degree of dissimilarity between the two
measured by the KL divergence. The KL divergence is given by
D
KL
(ˆp
data
p
model
) = E
xˆp
data
[log ˆp
data
(x) log p
model
(x)] . (5.60)
The term on the left is a function only of the data generating process, not the
model. This means when we train the model to minimize the KL divergence, we
need only minimize
E
xˆp
data
[log p
model
(x)] (5.61)
which is of course the same as the maximization in equation 5.59.
Minimizing this KL divergence corresponds exactly to minimizing the cross-
entropy between the distributions. Many authors use the term “cross-entropy” to
identify specifically the negative log-likelihood of a Bernoulli or softmax distribution,
but that is a misnomer. Any loss consisting of a negative log-likelihood is a cross-
entropy between the empirical distribution defined by the training set and the
probability distribution defined by model. For example, mean squared error is the
cross-entropy between the empirical distribution and a Gaussian model.
We can thus see maximum likelihood as an attempt to make the model dis-
tribution match the empirical distribution
ˆp
data
. Ideally, we would like to match
the true data generating distribution
p
data
, but we have no direct access to this
distribution.
While the optimal
θ
is the same regardless of whether we are maximizing the
likelihood or minimizing the KL divergence, the values of the objective functions
132
CHAPTER 5. MACHINE LEARNING BASICS
are different. In software, we often phrase both as minimizing a cost function.
Maximum likelihood thus becomes minimization of the negative log-likelihood
(NLL), or equivalently, minimization of the cross entropy. The perspective of
maximum likelihood as minimum KL divergence becomes helpful in this case
because the KL divergence has a known minimum value of zero. The negative
log-likelihood can actually become negative when x is real-valued.
5.5.1 Conditional Log-Likelihood and Mean Squared Error
The maximum likelihood estimator can readily be generalized to the case where
our goal is to estimate a conditional probability
P
(
y | x
;
θ
) in order to predict
y
given
x
. This is actually the most common situation because it forms the basis for
most supervised learning. If
X
represents all our inputs and
Y
all our observed
targets, then the conditional maximum likelihood estimator is
θ
ML
= arg max
θ
P (Y | X; θ). (5.62)
If the examples are assumed to be i.i.d., then this can be decomposed into
θ
ML
= arg max
θ
m
i=1
log P (y
(i)
| x
(i)
; θ). (5.63)
Example: Linear Regression as Maximum Likelihood
Linear regression,
introduced earlier in section 5.1.4, may be justified as a maximum likelihood
procedure. Previously, we motivated linear regression as an algorithm that learns
to take an input
x
and produce an output value
ˆy
. The mapping from
x
to
ˆy
is
chosen to minimize mean squared error, a criterion that we introduced more or less
arbitrarily. We now revisit linear regression from the point of view of maximum
likelihood estimation. Instead of producing a single prediction
ˆy
, we now think
of the model as producing a conditional distribution
p
(
y | x
). We can imagine
that with an infinitely large training set, we might see several training examples
with the same input value
x
but different values of
y
. The goal of the learning
algorithm is now to fit the distribution
p
(
y | x
) to all of those different
y
values
that are all compatible with
x
. To derive the same linear regression algorithm
we obtained before, we define
p
(
y | x
) =
N
(
y
;
ˆy
(
x
;
w
)
, σ
2
). The function
ˆy
(
x
;
w
)
gives the prediction of the mean of the Gaussian. In this example, we assume that
the variance is fixed to some constant
σ
2
chosen by the user. We will see that this
choice of the functional form of
p
(
y | x
) causes the maximum likelihood estimation
procedure to yield the same learning algorithm as we developed before. Since the
133
CHAPTER 5. MACHINE LEARNING BASICS
examples are assumed to be i.i.d., the conditional log-likelihood (equation 5.63) is
given by
m
i=1
log p(y
(i)
| x
(i)
; θ) (5.64)
= m log σ
m
2
log(2π)
m
i=1
ˆy
(i)
y
(i)
2
2σ
2
, (5.65)
where
ˆy
(i)
is the output of the linear regression on the
i
-th input
x
(i)
and
m
is the
number of the training examples. Comparing the log-likelihood with the mean
squared error,
MSE
train
=
1
m
m
i=1
||ˆy
(i)
y
(i)
||
2
, (5.66)
we immediately see that maximizing the log-likelihood with respect to
w
yields
the same estimate of the parameters
w
as does minimizing the mean squared error.
The two criteria have different values but the same location of the optimum. This
justifies the use of the MSE as a maximum likelihood estimation procedure. As we
will see, the maximum likelihood estimator has several desirable properties.
5.5.2 Properties of Maximum Likelihood
The main appeal of the maximum likelihood estimator is that it can be shown to
be the best estimator asymptotically, as the number of examples
m
, in terms
of its rate of convergence as m increases.
Under appropriate conditions, the maximum likelihood estimator has the
property of consistency (see section 5.4.5 above), meaning that as the number
of training examples approaches infinity, the maximum likelihood estimate of a
parameter converges to the true value of the parameter. These conditions are:
The true distribution
p
data
must lie within the model family
p
model
(
·
;
θ
).
Otherwise, no estimator can recover p
data
.
The true distribution
p
data
must correspond to exactly one value of
θ
. Other-
wise, maximum likelihood can recover the correct
p
data
, but will not be able
to determine which value of θ was used by the data generating processing.
There are other inductive principles besides the maximum likelihood estima-
tor, many of which share the property of being consistent estimators. However,
134
CHAPTER 5. MACHINE LEARNING BASICS
consistent estimators can differ in their
statistic efficiency
, meaning that one
consistent estimator may obtain lower generalization error for a fixed number of
samples
m
, or equivalently, may require fewer examples to obtain a fixed level of
generalization error.
Statistical efficiency is typically studied in the
parametric case
(like in linear
regression) where our goal is to estimate the value of a parameter (and assuming
it is possible to identify the true parameter), not the value of a function. A way to
measure how close we are to the true parameter is by the expected mean squared
error, computing the squared difference between the estimated and true parameter
values, where the expectation is over
m
training samples from the data generating
distribution. That parametric mean squared error decreases as
m
increases, and
for
m
large, the Cramér-Rao lower bound (Rao, 1945; Cramér, 1946) shows that no
consistent estimator has a lower mean squared error than the maximum likelihood
estimator.
For these reasons (consistency and efficiency), maximum likelihood is often
considered the preferred estimator to use for machine learning. When the number
of examples is small enough to yield overfitting behavior, regularization strategies
such as weight decay may be used to obtain a biased version of maximum likelihood
that has less variance when training data is limited.
5.6 Bayesian Statistics
So far we have discussed
frequentist statistics
and approaches based on estimat-
ing a single value of
θ
, then making all predictions thereafter based on that one
estimate. Another approach is to consider all possible values of
θ
when making a
prediction. The latter is the domain of Bayesian statistics.
As discussed in section 5.4.1, the frequentist perspective is that the true
parameter value
θ
is fixed but unknown, while the point estimate
ˆ
θ
is a random
variable on account of it being a function of the dataset (which is seen as random).
The Bayesian perspective on statistics is quite different. The Bayesian uses
probability to reflect degrees of certainty of states of knowledge. The dataset is
directly observed and so is not random. On the other hand, the true parameter
θ
is unknown or uncertain and thus is represented as a random variable.
Before observing the data, we represent our knowledge of
θ
using the
prior
probability distribution
,
p
(
θ
) (sometimes referred to as simply “the prior”).
Generally, the machine learning practitioner selects a prior distribution that is
quite broad (i.e. with high entropy) to reflect a high degree of uncertainty in the
135
CHAPTER 5. MACHINE LEARNING BASICS
value of
θ
before observing any data. For example, one might assume a priori that
θ
lies in some finite range or volume, with a uniform distribution. Many priors
instead reflect a preference for “simpler” solutions (such as smaller magnitude
coefficients, or a function that is closer to being constant).
Now consider that we have a set of data samples
{x
(1)
, . . . , x
(m)
}
. We can
recover the effect of data on our belief about
θ
by combining the data likelihood
p(x
(1)
, . . . , x
(m)
| θ) with the prior via Bayes’ rule:
p(θ | x
(1)
, . . . , x
(m)
) =
p(x
(1)
, . . . , x
(m)
| θ)p(θ)
p(x
(1)
, . . . , x
(m)
)
(5.67)
In the scenarios where Bayesian estimation is typically used, the prior begins as a
relatively uniform or Gaussian distribution with high entropy, and the observation
of the data usually causes the posterior to lose entropy and concentrate around a
few highly likely values of the parameters.
Relative to maximum likelihood estimation, Bayesian estimation offers two
important differences. First, unlike the maximum likelihood approach that makes
predictions using a point estimate of
θ
, the Bayesian approach is to make predictions
using a full distribution over
θ
. For example, after observing
m
examples, the
predicted distribution over the next data sample, x
(m+1)
, is given by
p(x
(m+1)
| x
(1)
, . . . , x
(m)
) =
p(x
(m+1)
| θ)p(θ | x
(1)
, . . . , x
(m)
) dθ. (5.68)
Here each value of
θ
with positive probability density contributes to the prediction
of the next example, with the contribution weighted by the posterior density itself.
After having observed
{x
(1)
, . . . , x
(m)
}
, if we are still quite uncertain about the
value of
θ
, then this uncertainty is incorporated directly into any predictions we
might make.
In section 5.4, we discussed how the frequentist approach addresses the uncer-
tainty in a given point estimate of
θ
by evaluating its variance. The variance of
the estimator is an assessment of how the estimate might change with alternative
samplings of the observed data. The Bayesian answer to the question of how to deal
with the uncertainty in the estimator is to simply integrate over it, which tends to
protect well against overfitting. This integral is of course just an application of
the laws of probability, making the Bayesian approach simple to justify, while the
frequentist machinery for constructing an estimator is based on the rather ad hoc
decision to summarize all knowledge contained in the dataset with a single point
estimate.
The second important difference between the Bayesian approach to estimation
and the maximum likelihood approach is due to the contribution of the Bayesian
136
CHAPTER 5. MACHINE LEARNING BASICS
prior distribution. The prior has an influence by shifting probability mass density
towards regions of the parameter space that are preferred a priori. In practice,
the prior often expresses a preference for models that are simpler or more smooth.
Critics of the Bayesian approach identify the prior as a source of subjective human
judgment impacting the predictions.
Bayesian methods typically generalize much better when limited training data
is available, but typically suffer from high computational cost when the number of
training examples is large.
Example: Bayesian Linear Regression
Here we consider the Bayesian esti-
mation approach to learning the linear regression parameters. In linear regression,
we learn a linear mapping from an input vector
x R
n
to predict the value of a
scalar y R. The prediction is parametrized by the vector w R
n
:
ˆy = w
x. (5.69)
Given a set of
m
training samples (
X
(train)
, y
(train)
), we can express the prediction
of y over the entire training set as:
ˆ
y
(train)
= X
(train)
w. (5.70)
Expressed as a Gaussian conditional distribution on y
(train)
, we have
p(y
(train)
| X
(train)
, w) = N(y
(train)
; X
(train)
w, I) (5.71)
exp
1
2
(y
(train)
X
(train)
w)
(y
(train)
X
(train)
w)
,
(5.72)
where we follow the standard MSE formulation in assuming that the Gaussian
variance on
y
is one. In what follows, to reduce the notational burden, we refer to
(X
(train)
, y
(train)
) as simply (X, y).
To determine the posterior distribution over the model parameter vector
w
, we
first need to specify a prior distribution. The prior should reflect our naive belief
about the value of these parameters. While it is sometimes difficult or unnatural
to express our prior beliefs in terms of the parameters of the model, in practice we
typically assume a fairly broad distribution expressing a high degree of uncertainty
about
θ
. For real-valued parameters it is common to use a Gaussian as a prior
distribution:
p(w) = N(w; µ
0
, Λ
0
) exp
1
2
(w µ
0
)
Λ
1
0
(w µ
0
)
, (5.73)
137
CHAPTER 5. MACHINE LEARNING BASICS
where
µ
0
and
Λ
0
are the prior distribution mean vector and covariance matrix
respectively.
1
With the prior thus specified, we can now proceed in determining the
posterior
distribution over the model parameters.
p(w | X, y) p(y | X, w)p(w) (5.74)
exp
1
2
(y Xw)
(y Xw)
exp
1
2
(w µ
0
)
Λ
1
0
(w µ
0
)
(5.75)
exp
1
2
2y
Xw + w
X
Xw + w
Λ
1
0
w 2µ
0
Λ
1
0
w
.
(5.76)
We now define
Λ
m
=
X
X + Λ
1
0
1
and
µ
m
= Λ
m
X
y + Λ
1
0
µ
0
. Using
these new variables, we find that the posterior may be rewritten as a Gaussian
distribution:
p(w | X, y) exp
1
2
(w µ
m
)
Λ
1
m
(w µ
m
) +
1
2
µ
m
Λ
1
m
µ
m
(5.77)
exp
1
2
(w µ
m
)
Λ
1
m
(w µ
m
)
. (5.78)
All terms that do not include the parameter vector
w
have been omitted; they
are implied by the fact that the distribution must be normalized to integrate to 1.
Equation 3.23 shows how to normalize a multivariate Gaussian distribution.
Examining this posterior distribution allows us to gain some intuition for the
effect of Bayesian inference. In most situations, we set
µ
0
to
0
. If we set
Λ
0
=
1
α
I
,
then
µ
m
gives the same estimate of
w
as does frequentist linear regression with a
weight decay penalty of
αw
w
. One difference is that the Bayesian estimate is
undefined if
α
is set to zero—-we are not allowed to begin the Bayesian learning
process with an infinitely wide prior on
w
. The more important difference is that
the Bayesian estimate provides a covariance matrix, showing how likely all the
different values of w are, rather than providing only the estimate µ
m
.
5.6.1 Maximum A Posteriori (MAP) Estimation
While the most principled approach is to make predictions using the full Bayesian
posterior distribution over the parameter
θ
, it is still often desirable to have a
1
Unless there is a reason to assume a particular covariance structure, we typically assume a
diagonal covariance matrix Λ
0
= diag(λ
0
).
138
CHAPTER 5. MACHINE LEARNING BASICS
single point estimate. One common reason for desiring a point estimate is that
most operations involving the Bayesian posterior for most interesting models are
intractable, and a point estimate offers a tractable approximation. Rather than
simply returning to the maximum likelihood estimate, we can still gain some of
the benefit of the Bayesian approach by allowing the prior to influence the choice
of the point estimate. One rational way to do this is to choose the
maximum
a posteriori
(MAP) point estimate. The MAP estimate chooses the point of
maximal posterior probability (or maximal probability density in the more common
case of continuous θ):
θ
MAP
= arg max
θ
p(θ | x) = arg max
θ
log p(x | θ) + log p(θ). (5.79)
We recognize, above on the right hand side,
log p
(
x | θ
), i.e. the standard log-
likelihood term, and log p(θ), corresponding to the prior distribution.
As an example, consider a linear regression model with a Gaussian prior on
the weights
w
. If this prior is given by
N
(
w
;
0,
1
λ
I
2
), then the log-prior term in
equation 5.79 is proportional to the familiar
λw
w
weight decay penalty, plus a
term that does not depend on
w
and does not affect the learning process. MAP
Bayesian inference with a Gaussian prior on the weights thus corresponds to weight
decay.
As with full Bayesian inference, MAP Bayesian inference has the advantage of
leveraging information that is brought by the prior and cannot be found in the
training data. This additional information helps to reduce the variance in the
MAP point estimate (in comparison to the ML estimate). However, it does so at
the price of increased bias.
Many regularized estimation strategies, such as maximum likelihood learning
regularized with weight decay, can be interpreted as making the MAP approxima-
tion to Bayesian inference. This view applies when the regularization consists of
adding an extra term to the objective function that corresponds to
log p
(
θ
). Not
all regularization penalties correspond to MAP Bayesian inference. For example,
some regularizer terms may not be the logarithm of a probability distribution.
Other regularization terms depend on the data, which of course a prior probability
distribution is not allowed to do.
MAP Bayesian inference provides a straightforward way to design complicated
yet interpretable regularization terms. For example, a more complicated penalty
term can be derived by using a mixture of Gaussians, rather than a single Gaussian
distribution, as the prior (Nowlan and Hinton, 1992).
139
CHAPTER 5. MACHINE LEARNING BASICS
5.7 Supervised Learning Algorithms
Recall from section 5.1.3 that supervised learning algorithms are, roughly speaking,
learning algorithms that learn to associate some input with some output, given a
training set of examples of inputs
x
and outputs
y
. In many cases the outputs
y
may be difficult to collect automatically and must be provided by a human
“supervisor,” but the term still applies even when the training set targets were
collected automatically.
5.7.1 Probabilistic Supervised Learning
Most supervised learning algorithms in this book are based on estimating a
probability distribution
p
(
y | x
). We can do this simply by using maximum
likelihood estimation to find the best parameter vector
θ
for a parametric family
of distributions p(y | x; θ).
We have already seen that linear regression corresponds to the family
p(y | x; θ) = N(y; θ
x, I). (5.80)
We can generalize linear regression to the classification scenario by defining a
different family of probability distributions. If we have two classes, class 0 and
class 1, then we need only specify the probability of one of these classes. The
probability of class 1 determines the probability of class 0, because these two values
must add up to 1.
The normal distribution over real-valued numbers that we used for linear
regression is parametrized in terms of a mean. Any value we supply for this mean
is valid. A distribution over a binary variable is slightly more complicated, because
its mean must always be between 0 and 1. One way to solve this problem is to use
the logistic sigmoid function to squash the output of the linear function into the
interval (0, 1) and interpret that value as a probability:
p(y = 1 | x; θ) = σ(θ
x). (5.81)
This approach is known as
logistic regression
(a somewhat strange name since
we use the model for classification rather than regression).
In the case of linear regression, we were able to find the optimal weights by
solving the normal equations. Logistic regression is somewhat more difficult. There
is no closed-form solution for its optimal weights. Instead, we must search for
them by maximizing the log-likelihood. We can do this by minimizing the negative
log-likelihood (NLL) using gradient descent.
140
CHAPTER 5. MACHINE LEARNING BASICS
This same strategy can be applied to essentially any supervised learning problem,
by writing down a parametric family of conditional probability distributions over
the right kind of input and output variables.
5.7.2 Support Vector Machines
One of the most influential approaches to supervised learning is the support vector
machine (Boser et al., 1992; Cortes and Vapnik, 1995). This model is similar to
logistic regression in that it is driven by a linear function
w
x
+
b
. Unlike logistic
regression, the support vector machine does not provide probabilities, but only
outputs a class identity. The SVM predicts that the positive class is present when
w
x
+
b
is positive. Likewise, it predicts that the negative class is present when
w
x + b is negative.
One key innovation associated with support vector machines is the
kernel
trick
. The kernel trick consists of observing that many machine learning algorithms
can be written exclusively in terms of dot products between examples. For example,
it can be shown that the linear function used by the support vector machine can
be re-written as
w
x + b = b +
m
i=1
α
i
x
x
(i)
(5.82)
where
x
(i)
is a training example and
α
is a vector of coefficients. Rewriting the
learning algorithm this way allows us to replace
x
by the output of a given feature
function
φ
(
x
) and the dot product with a function
k
(
x, x
(i)
) =
φ
(
x
)
·φ
(
x
(i)
) called
a
kernel
. The
·
operator represents an inner product analogous to
φ
(
x
)
φ
(
x
(i)
).
For some feature spaces, we may not use literally the vector inner product. In
some infinite dimensional spaces, we need to use other kinds of inner products, for
example, inner products based on integration rather than summation. A complete
development of these kinds of inner products is beyond the scope of this book.
After replacing dot products with kernel evaluations, we can make predictions
using the function
f(x) = b +
i
α
i
k(x, x
(i)
). (5.83)
This function is nonlinear with respect to
x
, but the relationship between
φ
(
x
)
and
f
(
x
) is linear. Also, the relationship between
α
and
f
(
x
) is linear. The
kernel-based function is exactly equivalent to preprocessing the data by applying
φ(x) to all inputs, then learning a linear model in the new transformed space.
The kernel trick is powerful for two reasons. First, it allows us to learn models
that are nonlinear as a function of
x
using convex optimization techniques that are
141
CHAPTER 5. MACHINE LEARNING BASICS
guaranteed to converge efficiently. This is possible because we consider
φ
fixed and
optimize only
α
, i.e., the optimization algorithm can view the decision function
as being linear in a different space. Second, the kernel function
k
often admits
an implementation that is significantly more computational efficient than naively
constructing two φ(x) vectors and explicitly taking their dot product.
In some cases,
φ
(
x
) can even be infinite dimensional, which would result in
an infinite computational cost for the naive, explicit approach. In many cases,
k
(
x, x
) is a nonlinear, tractable function of
x
even when
φ
(
x
) is intractable. As
an example of an infinite-dimensional feature space with a tractable kernel, we
construct a feature mapping
φ
(
x
) over the non-negative integers
x
. Suppose that
this mapping returns a vector containing
x
ones followed by infinitely many zeros.
We can write a kernel function
k
(
x, x
(i)
) =
min
(
x, x
(i)
) that is exactly equivalent
to the corresponding infinite-dimensional dot product.
The most commonly used kernel is the Gaussian kernel
k(u, v) = N(u v; 0, σ
2
I) (5.84)
where
N
(
x
;
µ, Σ
) is the standard normal density. This kernel is also known as
the
radial basis function
(RBF) kernel, because its value decreases along lines
in
v
space radiating outward from
u
. The Gaussian kernel corresponds to a dot
product in an infinite-dimensional space, but the derivation of this space is less
straightforward than in our example of the min kernel over the integers.
We can think of the Gaussian kernel as performing a kind of
template match-
ing
. A training example
x
associated with training label
y
becomes a template
for class
y
. When a test point
x
is near
x
according to Euclidean distance, the
Gaussian kernel has a large response, indicating that
x
is very similar to the
x
template. The model then puts a large weight on the associated training label
y
.
Overall, the prediction will combine many such training labels weighted by the
similarity of the corresponding training examples.
Support vector machines are not the only algorithm that can be enhanced
using the kernel trick. Many other linear models can be enhanced in this way. The
category of algorithms that employ the kernel trick is known as
kernel machines
or kernel methods (Williams and Rasmussen, 1996; Schölkopf et al., 1999).
A major drawback to kernel machines is that the cost of evaluating the decision
function is linear in the number of training examples, because the
i
-th example
contributes a term
α
i
k
(
x, x
(i)
) to the decision function. Support vector machines
are able to mitigate this by learning an
α
vector that contains mostly zeros.
Classifying a new example then requires evaluating the kernel function only for
the training examples that have non-zero
α
i
. These training examples are known
142
CHAPTER 5. MACHINE LEARNING BASICS
as support vectors.
Kernel machines also suffer from a high computational cost of training when
the dataset is large. We will revisit this idea in section 5.9. Kernel machines with
generic kernels struggle to generalize well. We will explain why in section 5.11. The
modern incarnation of deep learning was designed to overcome these limitations of
kernel machines. The current deep learning renaissance began when Hinton et al.
(2006) demonstrated that a neural network could outperform the RBF kernel SVM
on the MNIST benchmark.
5.7.3 Other Simple Supervised Learning Algorithms
We have already briefly encountered another non-probabilistic supervised learning
algorithm, nearest neighbor regression. More generally,
k
-nearest neighbors is
a family of techniques that can be used for classification or regression. As a
non-parametric learning algorithm,
k
-nearest neighbors is not restricted to a fixed
number of parameters. We usually think of the
k
-nearest neighbors algorithm
as not having any parameters, but rather implementing a simple function of the
training data. In fact, there is not even really a training stage or learning process.
Instead, at test time, when we want to produce an output
y
for a new test input
x
,
we find the
k
-nearest neighbors to
x
in the training data
X
. We then return the
average of the corresponding
y
values in the training set. This works for essentially
any kind of supervised learning where we can define an average over
y
values. In
the case of classification, we can average over one-hot code vectors
c
with
c
y
= 1
and
c
i
= 0 for all other values of
i
. We can then interpret the average over these
one-hot codes as giving a probability distribution over classes. As a non-parametric
learning algorithm,
k
-nearest neighbor can achieve very high capacity. For example,
suppose we have a multiclass classification task and measure performance with 0-1
loss. In this setting, 1-nearest neighbor converges to double the Bayes error as the
number of training examples approaches infinity. The error in excess of the Bayes
error results from choosing a single neighbor by breaking ties between equally
distant neighbors randomly. When there is infinite training data, all test points
x
will have infinitely many training set neighbors at distance zero. If we allow the
algorithm to use all of these neighbors to vote, rather than randomly choosing one
of them, the procedure converges to the Bayes error rate. The high capacity of
k
-nearest neighbors allows it to obtain high accuracy given a large training set.
However, it does so at high computational cost, and it may generalize very badly
given a small, finite training set. One weakness of
k
-nearest neighbors is that it
cannot learn that one feature is more discriminative than another. For example,
imagine we have a regression task with
x R
100
drawn from an isotropic Gaussian
143
CHAPTER 5. MACHINE LEARNING BASICS
distribution, but only a single variable
x
1
is relevant to the output. Suppose
further that this feature simply encodes the output directly, i.e. that
y
=
x
1
in all
cases. Nearest neighbor regression will not be able to detect this simple pattern.
The nearest neighbor of most points
x
will be determined by the large number of
features
x
2
through
x
100
, not by the lone feature
x
1
. Thus the output on small
training sets will essentially be random.
144
CHAPTER 5. MACHINE LEARNING BASICS
0
1
01
111
0 1
011
11111110
110
10
010
00
1110 1111
110
10
01
00
010 011
11
111
11
Figure 5.7: Diagrams describing how a decision tree works. (Top)Each node of the tree
chooses to send the input example to the child node on the left (0) or or the child node on
the right (1). Internal nodes are drawn as circles and leaf nodes as squares. Each node is
displayed with a binary string identifier corresponding to its position in the tree, obtained
by appending a bit to its parent identifier (0=choose left or top, 1=choose right or bottom).
(Bottom)The tree divides space into regions. The 2D plane shows how a decision tree
might divide
R
2
. The nodes of the tree are plotted in this plane, with each internal node
drawn along the dividing line it uses to categorize examples, and leaf nodes drawn in the
center of the region of examples they receive. The result is a piecewise-constant function,
with one piece per leaf. Each leaf requires at least one training example to define, so it is
not possible for the decision tree to learn a function that has more local maxima than the
number of training examples.
145
CHAPTER 5. MACHINE LEARNING BASICS
Another type of learning algorithm that also breaks the input space into regions
and has separate parameters for each region is the
decision tree
(Breiman et al.,
1984) and its many variants. As shown in figure 5.7, each node of the decision
tree is associated with a region in the input space, and internal nodes break that
region into one sub-region for each child of the node (typically using an axis-aligned
cut). Space is thus sub-divided into non-overlapping regions, with a one-to-one
correspondence between leaf nodes and input regions. Each leaf node usually maps
every point in its input region to the same output. Decision trees are usually
trained with specialized algorithms that are beyond the scope of this book. The
learning algorithm can be considered non-parametric if it is allowed to learn a tree
of arbitrary size, though decision trees are usually regularized with size constraints
that turn them into parametric models in practice. Decision trees as they are
typically used, with axis-aligned splits and constant outputs within each node,
struggle to solve some problems that are easy even for logistic regression. For
example, if we have a two-class problem and the positive class occurs wherever
x
2
> x
1
, the decision boundary is not axis-aligned. The decision tree will thus
need to approximate the decision boundary with many nodes, implementing a step
function that constantly walks back and forth across the true decision function
with axis-aligned steps.
As we have seen, nearest neighbor predictors and decision trees have many
limitations. Nonetheless, they are useful learning algorithms when computational
resources are constrained. We can also build intuition for more sophisticated
learning algorithms by thinking about the similarities and differences between
sophisticated algorithms and k-NN or decision tree baselines.
See Murphy (2012), Bishop (2006), Hastie et al. (2001) or other machine
learning textbooks for more material on traditional supervised learning algorithms.
5.8 Unsupervised Learning Algorithms
Recall from section 5.1.3 that unsupervised algorithms are those that experience
only “features” but not a supervision signal. The distinction between supervised
and unsupervised algorithms is not formally and rigidly defined because there is no
objective test for distinguishing whether a value is a feature or a target provided by
a supervisor. Informally, unsupervised learning refers to most attempts to extract
information from a distribution that do not require human labor to annotate
examples. The term is usually associated with density estimation, learning to
draw samples from a distribution, learning to denoise data from some distribution,
finding a manifold that the data lies near, or clustering the data into groups of
146
CHAPTER 5. MACHINE LEARNING BASICS
related examples.
A classic unsupervised learning task is to find the “best” representation of the
data. By ‘best’ we can mean different things, but generally speaking we are looking
for a representation that preserves as much information about
x
as possible while
obeying some penalty or constraint aimed at keeping the representation simpler or
more accessible than x itself.
There are multiple ways of defining a simpler representation. Three of the
most common include lower dimensional representations, sparse representations
and independent representations. Low-dimensional representations attempt to
compress as much information about
x
as possible in a smaller representation.
Sparse representations (Barlow, 1989; Olshausen and Field, 1996; Hinton and
Ghahramani, 1997) embed the dataset into a representation whose entries are
mostly zeroes for most inputs. The use of sparse representations typically requires
increasing the dimensionality of the representation, so that the representation
becoming mostly zeroes does not discard too much information. This results in an
overall structure of the representation that tends to distribute data along the axes
of the representation space. Independent representations attempt to disentangle
the sources of variation underlying the data distribution such that the dimensions
of the representation are statistically independent.
Of course these three criteria are certainly not mutually exclusive. Low-
dimensional representations often yield elements that have fewer or weaker de-
pendencies than the original high-dimensional data. This is because one way to
reduce the size of a representation is to find and remove redundancies. Identifying
and removing more redundancy allows the dimensionality reduction algorithm to
achieve more compression while discarding less information.
The notion of representation is one of the central themes of deep learning and
therefore one of the central themes in this book. In this section, we develop some
simple examples of representation learning algorithms. Together, these example
algorithms show how to operationalize all three of the criteria above. Most of the
remaining chapters introduce additional representation learning algorithms that
develop these criteria in different ways or introduce other criteria.
5.8.1 Principal Components Analysis
In section 2.12, we saw that the principal components analysis algorithm provides
a means of compressing data. We can also view PCA as an unsupervised learning
algorithm that learns a representation of data. This representation is based on
two of the criteria for a simple representation described above. PCA learns a
147
CHAPTER 5. MACHINE LEARNING BASICS
20 10 0 10 20
x
1
20
10
0
10
20
x
2
20 10 0 10 20
z
1
20
10
0
10
20
z
2
Figure 5.8: PCA learns a linear projection that aligns the direction of greatest variance
with the axes of the new space. (Left)The original data consists of samples of
x
. In this
space, the variance might occur along directions that are not axis-aligned. (Right)The
transformed data
z
=
x
W
now varies most along the axis
z
1
. The direction of second
most variance is now along z
2
.
representation that has lower dimensionality than the original input. It also learns
a representation whose elements have no linear correlation with each other. This
is a first step toward the criterion of learning representations whose elements are
statistically independent. To achieve full independence, a representation learning
algorithm must also remove the nonlinear relationships between variables.
PCA learns an orthogonal, linear transformation of the data that projects an
input
x
to a representation
z
as shown in figure 5.8. In section 2.12, we saw that
we could learn a one-dimensional representation that best reconstructs the original
data (in the sense of mean squared error) and that this representation actually
corresponds to the first principal component of the data. Thus we can use PCA
as a simple and effective dimensionality reduction method that preserves as much
of the information in the data as possible (again, as measured by least-squares
reconstruction error). In the following, we will study how the PCA representation
decorrelates the original data representation X.
Let us consider the
m × n
-dimensional design matrix
X
. We will assume that
the data has a mean of zero,
E
[
x
] =
0
. If this is not the case, the data can easily
be centered by subtracting the mean from all examples in a preprocessing step.
The unbiased sample covariance matrix associated with X is given by:
Var[x] =
1
m 1
X
X. (5.85)
148
CHAPTER 5. MACHINE LEARNING BASICS
PCA finds a representation (through linear transformation)
z
=
x
W
where
Var[z] is diagonal.
In section 2.12, we saw that the principal components of a design matrix
X
are given by the eigenvectors of X
X. From this view,
X
X = W ΛW
. (5.86)
In this section, we exploit an alternative derivation of the principal components. The
principal components may also be obtained via the singular value decomposition.
Specifically, they are the right singular vectors of
X
. To see this, let
W
be the
right singular vectors in the decomposition
X
=
UΣW
. We then recover the
original eigenvector equation with W as the eigenvector basis:
X
X =
UΣW
UΣW
= W Σ
2
W
. (5.87)
The SVD is helpful to show that PCA results in a diagonal
Var
[
z
]. Using the
SVD of X, we can express the variance of X as:
Var[x] =
1
m 1
X
X (5.88)
=
1
m 1
(UΣW
)
UΣW
(5.89)
=
1
m 1
W Σ
U
UΣW
(5.90)
=
1
m 1
W Σ
2
W
, (5.91)
where we use the fact that
U
U
=
I
because the
U
matrix of the singular value
decomposition is defined to be orthogonal. This shows that if we take
z
=
x
W
,
we can ensure that the covariance of z is diagonal as required:
Var[z] =
1
m 1
Z
Z (5.92)
=
1
m 1
W
X
XW (5.93)
=
1
m 1
W
W Σ
2
W
W (5.94)
=
1
m 1
Σ
2
, (5.95)
where this time we use the fact that
W
W
=
I
, again from the definition of the
SVD.
149
CHAPTER 5. MACHINE LEARNING BASICS
The above analysis shows that when we project the data
x
to
z
, via the linear
transformation
W
, the resulting representation has a diagonal covariance matrix
(as given by
Σ
2
) which immediately implies that the individual elements of
z
are
mutually uncorrelated.
This ability of PCA to transform data into a representation where the elements
are mutually uncorrelated is a very important property of PCA. It is a simple
example of a representation that attempts to disentangle the unknown factors of
variation underlying the data. In the case of PCA, this disentangling takes the
form of finding a rotation of the input space (described by
W
) that aligns the
principal axes of variance with the basis of the new representation space associated
with z.
While correlation is an important category of dependency between elements of
the data, we are also interested in learning representations that disentangle more
complicated forms of feature dependencies. For this, we will need more than what
can be done with a simple linear transformation.
5.8.2 k-means Clustering
Another example of a simple representation learning algorithm is
k
-means clustering.
The
k
-means clustering algorithm divides the training set into
k
different clusters
of examples that are near each other. We can thus think of the algorithm as
providing a
k
-dimensional one-hot code vector
h
representing an input
x
. If
x
belongs to cluster
i
, then
h
i
= 1 and all other entries of the representation
h
are
zero.
The one-hot code provided by
k
-means clustering is an example of a sparse
representation, because the majority of its entries are zero for every input. Later,
we will develop other algorithms that learn more flexible sparse representations,
where more than one entry can be non-zero for each input
x
. One-hot codes
are an extreme example of sparse representations that lose many of the benefits
of a distributed representation. The one-hot code still confers some statistical
advantages (it naturally conveys the idea that all examples in the same cluster are
similar to each other) and it confers the computational advantage that the entire
representation may be captured by a single integer.
The
k
-means algorithm works by initializing
k
different centroids
{µ
(1)
, . . . , µ
(k)
}
to different values, then alternating between two different steps until convergence.
In one step, each training example is assigned to cluster
i
, where
i
is the index of
the nearest centroid
µ
(i)
. In the other step, each centroid
µ
(i)
is updated to the
mean of all training examples x
(j)
assigned to cluster i.
150
CHAPTER 5. MACHINE LEARNING BASICS
One difficulty pertaining to clustering is that the clustering problem is inherently
ill-posed, in the sense that there is no single criterion that measures how well a
clustering of the data corresponds to the real world. We can measure properties of
the clustering such as the average Euclidean distance from a cluster centroid to the
members of the cluster. This allows us to tell how well we are able to reconstruct
the training data from the cluster assignments. We do not know how well the
cluster assignments correspond to properties of the real world. Moreover, there
may be many different clusterings that all correspond well to some property of
the real world. We may hope to find a clustering that relates to one feature but
obtain a different, equally valid clustering that is not relevant to our task. For
example, suppose that we run two clustering algorithms on a dataset consisting of
images of red trucks, images of red cars, images of gray trucks, and images of gray
cars. If we ask each clustering algorithm to find two clusters, one algorithm may
find a cluster of cars and a cluster of trucks, while another may find a cluster of
red vehicles and a cluster of gray vehicles. Suppose we also run a third clustering
algorithm, which is allowed to determine the number of clusters. This may assign
the examples to four clusters, red cars, red trucks, gray cars, and gray trucks. This
new clustering now at least captures information about both attributes, but it has
lost information about similarity. Red cars are in a different cluster from gray
cars, just as they are in a different cluster from gray trucks. The output of the
clustering algorithm does not tell us that red cars are more similar to gray cars
than they are to gray trucks. They are different from both things, and that is all
we know.
These issues illustrate some of the reasons that we may prefer a distributed
representation to a one-hot representation. A distributed representation could have
two attributes for each vehicle—one representing its color and one representing
whether it is a car or a truck. It is still not entirely clear what the optimal
distributed representation is (how can the learning algorithm know whether the
two attributes we are interested in are color and car-versus-truck rather than
manufacturer and age?) but having many attributes reduces the burden on the
algorithm to guess which single attribute we care about, and allows us to measure
similarity between objects in a fine-grained way by comparing many attributes
instead of just testing whether one attribute matches.
5.9 Stochastic Gradient Descent
Nearly all of deep learning is powered by one very important algorithm:
stochastic
gradient descent
or SGD. Stochastic gradient descent is an extension of the
151
CHAPTER 5. MACHINE LEARNING BASICS
gradient descent algorithm introduced in section 4.3.
A recurring problem in machine learning is that large training sets are necessary
for good generalization, but large training sets are also more computationally
expensive.
The cost function used by a machine learning algorithm often decomposes as a
sum over training examples of some per-example loss function. For example, the
negative conditional log-likelihood of the training data can be written as
J(θ) = E
x,yˆp
data
L(x, y, θ) =
1
m
m
i=1
L(x
(i)
, y
(i)
, θ) (5.96)
where L is the per-example loss L(x, y, θ) = log p(y | x; θ).
For these additive cost functions, gradient descent requires computing
θ
J(θ) =
1
m
m
i=1
θ
L(x
(i)
, y
(i)
, θ). (5.97)
The computational cost of this operation is
O
(
m
). As the training set size grows to
billions of examples, the time to take a single gradient step becomes prohibitively
long.
The insight of stochastic gradient descent is that the gradient is an expectation.
The expectation may be approximately estimated using a small set of samples.
Specifically, on each step of the algorithm, we can sample a
minibatch
of examples
B
=
{x
(1)
, . . . , x
(m
)
}
drawn uniformly from the training set. The minibatch size
m
is typically chosen to be a relatively small number of examples, ranging from
1 to a few hundred. Crucially,
m
is usually held fixed as the training set size
m
grows. We may fit a training set with billions of examples using updates computed
on only a hundred examples.
The estimate of the gradient is formed as
g =
1
m
θ
m
i=1
L(x
(i)
, y
(i)
, θ). (5.98)
using examples from the minibatch B. The stochastic gradient descent algorithm
then follows the estimated gradient downhill:
θ θ g, (5.99)
where is the learning rate.
152
CHAPTER 5. MACHINE LEARNING BASICS
Gradient descent in general has often been regarded as slow or unreliable. In
the past, the application of gradient descent to non-convex optimization problems
was regarded as foolhardy or unprincipled. Today, we know that the machine
learning models described in part II work very well when trained with gradient
descent. The optimization algorithm may not be guaranteed to arrive at even a
local minimum in a reasonable amount of time, but it often finds a very low value
of the cost function quickly enough to be useful.
Stochastic gradient descent has many important uses outside the context of
deep learning. It is the main way to train large linear models on very large
datasets. For a fixed model size, the cost per SGD update does not depend on the
training set size
m
. In practice, we often use a larger model as the training set size
increases, but we are not forced to do so. The number of updates required to reach
convergence usually increases with training set size. However, as
m
approaches
infinity, the model will eventually converge to its best possible test error before
SGD has sampled every example in the training set. Increasing
m
further will not
extend the amount of training time needed to reach the model’s best possible test
error. From this point of view, one can argue that the asymptotic cost of training
a model with SGD is O(1) as a function of m.
Prior to the advent of deep learning, the main way to learn nonlinear models
was to use the kernel trick in combination with a linear model. Many kernel learning
algorithms require constructing an
m×m
matrix
G
i,j
=
k
(
x
(i)
, x
(j)
). Constructing
this matrix has computational cost
O
(
m
2
), which is clearly undesirable for datasets
with billions of examples. In academia, starting in 2006, deep learning was
initially interesting because it was able to generalize to new examples better
than competing algorithms when trained on medium-sized datasets with tens of
thousands of examples. Soon after, deep learning garnered additional interest in
industry, because it provided a scalable way of training nonlinear models on large
datasets.
Stochastic gradient descent and many enhancements to it are described further
in chapter 8.
5.10 Building a Machine Learning Algorithm
Nearly all deep learning algorithms can be described as particular instances of
a fairly simple recipe: combine a specification of a dataset, a cost function, an
optimization procedure and a model.
For example, the linear regression algorithm combines a dataset consisting of
153
CHAPTER 5. MACHINE LEARNING BASICS
X and y, the cost function
J(w, b) = E
x,yˆp
data
log p
model
(y | x), (5.100)
the model specification
p
model
(
y | x
) =
N
(
y
;
x
w
+
b,
1), and, in most cases, the
optimization algorithm defined by solving for where the gradient of the cost is zero
using the normal equations.
By realizing that we can replace any of these components mostly independently
from the others, we can obtain a very wide variety of algorithms.
The cost function typically includes at least one term that causes the learning
process to perform statistical estimation. The most common cost function is the
negative log-likelihood, so that minimizing the cost function causes maximum
likelihood estimation.
The cost function may also include additional terms, such as regularization
terms. For example, we can add weight decay to the linear regression cost function
to obtain
J(w, b) = λ||w||
2
2
E
x,yˆp
data
log p
model
(y | x). (5.101)
This still allows closed-form optimization.
If we change the model to be nonlinear, then most cost functions can no longer
be optimized in closed form. This requires us to choose an iterative numerical
optimization procedure, such as gradient descent.
The recipe for constructing a learning algorithm by combining models, costs, and
optimization algorithms supports both supervised and unsupervised learning. The
linear regression example shows how to support supervised learning. Unsupervised
learning can be supported by defining a dataset that contains only
X
and providing
an appropriate unsupervised cost and model. For example, we can obtain the first
PCA vector by specifying that our loss function is
J(w) = E
xˆp
data
||x r(x; w)||
2
2
(5.102)
while our model is defined to have
w
with norm one and reconstruction function
r(x) = w
xw.
In some cases, the cost function may be a function that we cannot actually
evaluate, for computational reasons. In these cases, we can still approximately
minimize it using iterative numerical optimization so long as we have some way of
approximating its gradients.
Most machine learning algorithms make use of this recipe, though it may not
immediately be obvious. If a machine learning algorithm seems especially unique or
154
CHAPTER 5. MACHINE LEARNING BASICS
hand-designed, it can usually be understood as using a special-case optimizer. Some
models such as decision trees or
k
-means require special-case optimizers because
their cost functions have flat regions that make them inappropriate for minimization
by gradient-based optimizers. Recognizing that most machine learning algorithms
can be described using this recipe helps to see the different algorithms as part of a
taxonomy of methods for doing related tasks that work for similar reasons, rather
than as a long list of algorithms that each have separate justifications.
5.11 Challenges Motivating Deep Learning
The simple machine learning algorithms described in this chapter work very well on
a wide variety of important problems. However, they have not succeeded in solving
the central problems in AI, such as recognizing speech or recognizing objects.
The development of deep learning was motivated in part by the failure of
traditional algorithms to generalize well on such AI tasks.
This section is about how the challenge of generalizing to new examples becomes
exponentially more difficult when working with high-dimensional data, and how
the mechanisms used to achieve generalization in traditional machine learning
are insufficient to learn complicated functions in high-dimensional spaces. Such
spaces also often impose high computational costs. Deep learning was designed to
overcome these and other obstacles.
5.11.1 The Curse of Dimensionality
Many machine learning problems become exceedingly difficult when the number
of dimensions in the data is high. This phenomenon is known as the
curse of
dimensionality
. Of particular concern is that the number of possible distinct
configurations of a set of variables increases exponentially as the number of variables
increases.
155
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.9: As the number of relevant dimensions of the data increases (from left to
right), the number of configurations of interest may grow exponentially. (Left)In this
one-dimensional example, we have one variable for which we only care to distinguish 10
regions of interest. With enough examples falling within each of these regions (each region
corresponds to a cell in the illustration), learning algorithms can easily generalize correctly.
A straightforward way to generalize is to estimate the value of the target function within
each region (and possibly interpolate between neighboring regions). (Center)With 2
dimensions it is more difficult to distinguish 10 different values of each variable. We need
to keep track of up to 10
×
10=100 regions, and we need at least that many examples to
cover all those regions. (Right)With 3 dimensions this grows to 10
3
= 1000 regions and at
least that many examples. For
d
dimensions and
v
values to be distinguished along each
axis, we seem to need
O
(
v
d
) regions and examples. This is an instance of the curse of
dimensionality. Figure graciously provided by Nicolas Chapados.
The curse of dimensionality arises in many places in computer science, and
especially so in machine learning.
One challenge posed by the curse of dimensionality is a statistical challenge.
As illustrated in figure 5.9, a statistical challenge arises because the number of
possible configurations of
x
is much larger than the number of training examples.
To understand the issue, let us consider that the input space is organized into a
grid, like in the figure. We can describe low-dimensional space with a low number
of grid cells that are mostly occupied by the data. When generalizing to a new data
point, we can usually tell what to do simply by inspecting the training examples
that lie in the same cell as the new input. For example, if estimating the probability
density at some point
x
, we can just return the number of training examples in
the same unit volume cell as
x
, divided by the total number of training examples.
If we wish to classify an example, we can return the most common class of training
examples in the same cell. If we are doing regression we can average the target
values observed over the examples in that cell. But what about the cells for which
we have seen no example? Because in high-dimensional spaces the number of
configurations is huge, much larger than our number of examples, a typical grid cell
has no training example associated with it. How could we possibly say something
156
CHAPTER 5. MACHINE LEARNING BASICS
meaningful about these new configurations? Many traditional machine learning
algorithms simply assume that the output at a new point should be approximately
the same as the output at the nearest training point.
5.11.2 Local Constancy and Smoothness Regularization
In order to generalize well, machine learning algorithms need to be guided by prior
beliefs about what kind of function they should learn. Previously, we have seen
these priors incorporated as explicit beliefs in the form of probability distributions
over parameters of the model. More informally, we may also discuss prior beliefs as
directly influencing the function itself and only indirectly acting on the parameters
via their effect on the function. Additionally, we informally discuss prior beliefs as
being expressed implicitly, by choosing algorithms that are biased toward choosing
some class of functions over another, even though these biases may not be expressed
(or even possible to express) in terms of a probability distribution representing our
degree of belief in various functions.
Among the most widely used of these implicit “priors” is the
smoothness
prior
or
local constancy prior
. This prior states that the function we learn
should not change very much within a small region.
Many simpler algorithms rely exclusively on this prior to generalize well, and
as a result they fail to scale to the statistical challenges involved in solving AI-
level tasks. Throughout this book, we will describe how deep learning introduces
additional (explicit and implicit) priors in order to reduce the generalization
error on sophisticated tasks. Here, we explain why the smoothness prior alone is
insufficient for these tasks.
There are many different ways to implicitly or explicitly express a prior belief
that the learned function should be smooth or locally constant. All of these different
methods are designed to encourage the learning process to learn a function
f
that
satisfies the condition
f
(x) f
(x + ) (5.103)
for most configurations
x
and small change
. In other words, if we know a good
answer for an input
x
(for example, if
x
is a labeled training example) then that
answer is probably good in the neighborhood of
x
. If we have several good answers
in some neighborhood we would combine them (by some form of averaging or
interpolation) to produce an answer that agrees with as many of them as much as
possible.
An extreme example of the local constancy approach is the
k
-nearest neighbors
family of learning algorithms. These predictors are literally constant over each
157
CHAPTER 5. MACHINE LEARNING BASICS
region containing all the points
x
that have the same set of
k
nearest neighbors in
the training set. For
k
= 1, the number of distinguishable regions cannot be more
than the number of training examples.
While the
k
-nearest neighbors algorithm copies the output from nearby training
examples, most kernel machines interpolate between training set outputs associated
with nearby training examples. An important class of kernels is the family of
local
kernels
where
k
(
u, v
) is large when
u
=
v
and decreases as
u
and
v
grow farther
apart from each other. A local kernel can be thought of as a similarity function
that performs template matching, by measuring how closely a test example
x
resembles each training example
x
(i)
. Much of the modern motivation for deep
learning is derived from studying the limitations of local template matching and
how deep models are able to succeed in cases where local template matching fails
(Bengio et al., 2006b).
Decision trees also suffer from the limitations of exclusively smoothness-based
learning because they break the input space into as many regions as there are
leaves and use a separate parameter (or sometimes many parameters for extensions
of decision trees) in each region. If the target function requires a tree with at
least
n
leaves to be represented accurately, then at least
n
training examples are
required to fit the tree. A multiple of
n
is needed to achieve some level of statistical
confidence in the predicted output.
In general, to distinguish
O
(
k
) regions in input space, all of these methods
require
O
(
k
) examples. Typically there are
O
(
k
) parameters, with
O
(1) parameters
associated with each of the
O
(
k
) regions. The case of a nearest neighbor scenario,
where each training example can be used to define at most one region, is illustrated
in figure 5.10.
Is there a way to represent a complex function that has many more regions
to be distinguished than the number of training examples? Clearly, assuming
only smoothness of the underlying function will not allow a learner to do that.
For example, imagine that the target function is a kind of checkerboard. A
checkerboard contains many variations but there is a simple structure to them.
Imagine what happens when the number of training examples is substantially
smaller than the number of black and white squares on the checkerboard. Based
on only local generalization and the smoothness or local constancy prior, we would
be guaranteed to correctly guess the color of a new point if it lies within the same
checkerboard square as a training example. There is no guarantee that the learner
could correctly extend the checkerboard pattern to points lying in squares that do
not contain training examples. With this prior alone, the only information that an
example tells us is the color of its square, and the only way to get the colors of the
158
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.10: Illustration of how the nearest neighbor algorithm breaks up the input space
into regions. An example (represented here by a circle) within each region defines the
region boundary (represented here by the lines). The
y
value associated with each example
defines what the output should be for all points within the corresponding region. The
regions defined by nearest neighbor matching form a geometric pattern called a Voronoi
diagram. The number of these contiguous regions cannot grow faster than the number
of training examples. While this figure illustrates the behavior of the nearest neighbor
algorithm specifically, other machine learning algorithms that rely exclusively on the
local smoothness prior for generalization exhibit similar behaviors: each training example
only informs the learner about how to generalize in some neighborhood immediately
surrounding that example.
159
CHAPTER 5. MACHINE LEARNING BASICS
entire checkerboard right is to cover each of its cells with at least one example.
The smoothness assumption and the associated non-parametric learning algo-
rithms work extremely well so long as there are enough examples for the learning
algorithm to observe high points on most peaks and low points on most valleys
of the true underlying function to be learned. This is generally true when the
function to be learned is smooth enough and varies in few enough dimensions.
In high dimensions, even a very smooth function can change smoothly but in a
different way along each dimension. If the function additionally behaves differently
in different regions, it can become extremely complicated to describe with a set of
training examples. If the function is complicated (we want to distinguish a huge
number of regions compared to the number of examples), is there any hope to
generalize well?
The answer to both of these questions—whether it is possible to represent
a complicated function efficiently, and whether it is possible for the estimated
function to generalize well to new inputs—is yes. The key insight is that a very
large number of regions, e.g.,
O
(2
k
), can be defined with
O
(
k
) examples, so long
as we introduce some dependencies between the regions via additional assumptions
about the underlying data generating distribution. In this way, we can actually
generalize non-locally (Bengio and Monperrus, 2005; Bengio et al., 2006c). Many
different deep learning algorithms provide implicit or explicit assumptions that are
reasonable for a broad range of AI tasks in order to capture these advantages.
Other approaches to machine learning often make stronger, task-specific as-
sumptions. For example, we could easily solve the checkerboard task by providing
the assumption that the target function is periodic. Usually we do not include such
strong, task-specific assumptions into neural networks so that they can generalize
to a much wider variety of structures. AI tasks have structure that is much too
complex to be limited to simple, manually specified properties such as periodicity,
so we want learning algorithms that embody more general-purpose assumptions.
The core idea in deep learning is that we assume that the data was generated by
the composition of factors or features, potentially at multiple levels in a hierarchy.
Many other similarly generic assumptions can further improve deep learning al-
gorithms. These apparently mild assumptions allow an exponential gain in the
relationship between the number of examples and the number of regions that can
be distinguished. These exponential gains are described more precisely in sections
6.4.1, 15.4 and 15.5. The exponential advantages conferred by the use of deep,
distributed representations counter the exponential challenges posed by the curse
of dimensionality.
160
CHAPTER 5. MACHINE LEARNING BASICS
5.11.3 Manifold Learning
An important concept underlying many ideas in machine learning is that of a
manifold.
A
manifold
is a connected region. Mathematically, it is a set of points,
associated with a neighborhood around each point. From any given point, the
manifold locally appears to be a Euclidean space. In everyday life, we experience
the surface of the world as a 2-D plane, but it is in fact a spherical manifold in
3-D space.
The definition of a neighborhood surrounding each point implies the existence
of transformations that can be applied to move on the manifold from one position
to a neighboring one. In the example of the world’s surface as a manifold, one can
walk north, south, east, or west.
Although there is a formal mathematical meaning to the term “manifold,” in
machine learning it tends to be used more loosely to designate a connected set
of points that can be approximated well by considering only a small number of
degrees of freedom, or dimensions, embedded in a higher-dimensional space. Each
dimension corresponds to a local direction of variation. See figure 5.11 for an
example of training data lying near a one-dimensional manifold embedded in two-
dimensional space. In the context of machine learning, we allow the dimensionality
of the manifold to vary from one point to another. This often happens when a
manifold intersects itself. For example, a figure eight is a manifold that has a single
dimension in most places but two dimensions at the intersection at the center.
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0
1.0
0.5
0.0
0.5
1.0
1.5
2.0
2.5
Figure 5.11: Data sampled from a distribution in a two-dimensional space that is actually
concentrated near a one-dimensional manifold, like a twisted string. The solid line indicates
the underlying manifold that the learner should infer.
161
CHAPTER 5. MACHINE LEARNING BASICS
Many machine learning problems seem hopeless if we expect the machine
learning algorithm to learn functions with interesting variations across all of
R
n
.
Manifold learning
algorithms surmount this obstacle by assuming that most
of
R
n
consists of invalid inputs, and that interesting inputs occur only along
a collection of manifolds containing a small subset of points, with interesting
variations in the output of the learned function occurring only along directions
that lie on the manifold, or with interesting variations happening only when we
move from one manifold to another. Manifold learning was introduced in the case
of continuous-valued data and the unsupervised learning setting, although this
probability concentration idea can be generalized to both discrete data and the
supervised learning setting: the key assumption remains that probability mass is
highly concentrated.
The assumption that the data lies along a low-dimensional manifold may not
always be correct or useful. We argue that in the context of AI tasks, such as
those that involve processing images, sounds, or text, the manifold assumption is
at least approximately correct. The evidence in favor of this assumption consists
of two categories of observations.
The first observation in favor of the manifold hypothesis is that the proba-
bility distribution over images, text strings, and sounds that occur in real life is
highly concentrated. Uniform noise essentially never resembles structured inputs
from these domains. Figure 5.12 shows how, instead, uniformly sampled points
look like the patterns of static that appear on analog television sets when no signal
is available. Similarly, if you generate a document by picking letters uniformly at
random, what is the probability that you will get a meaningful English-language
text? Almost zero, again, because most of the long sequences of letters do not
correspond to a natural language sequence: the distribution of natural language
sequences occupies a very small volume in the total space of sequences of letters.
162
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.12: Sampling images uniformly at random (by randomly picking each pixel
according to a uniform distribution) gives rise to noisy images. Although there is a non-
zero probability to generate an image of a face or any other object frequently encountered
in AI applications, we never actually observe this happening in practice. This suggests
that the images encountered in AI applications occupy a negligible proportion of the
volume of image space.
Of course, concentrated probability distributions are not sufficient to show
that the data lies on a reasonably small number of manifolds. We must also
establish that the examples we encounter are connected to each other by other
163
CHAPTER 5. MACHINE LEARNING BASICS
examples, with each example surrounded by other highly similar examples that
may be reached by applying transformations to traverse the manifold. The second
argument in favor of the manifold hypothesis is that we can also imagine such
neighborhoods and transformations, at least informally. In the case of images, we
can certainly think of many possible transformations that allow us to trace out a
manifold in image space: we can gradually dim or brighten the lights, gradually
move or rotate objects in the image, gradually alter the colors on the surfaces of
objects, etc. It remains likely that there are multiple manifolds involved in most
applications. For example, the manifold of images of human faces may not be
connected to the manifold of images of cat faces.
These thought experiments supporting the manifold hypotheses convey some in-
tuitive reasons supporting it. More rigorous experiments (Cayton, 2005; Narayanan
and Mitter, 2010; Schölkopf et al., 1998; Roweis and Saul, 2000; Tenenbaum et al.,
2000; Brand, 2003; Belkin and Niyogi, 2003; Donoho and Grimes, 2003; Weinberger
and Saul, 2004) clearly support the hypothesis for a large class of datasets of
interest in AI.
When the data lies on a low-dimensional manifold, it can be most natural
for machine learning algorithms to represent the data in terms of coordinates on
the manifold, rather than in terms of coordinates in
R
n
. In everyday life, we can
think of roads as 1-D manifolds embedded in 3-D space. We give directions to
specific addresses in terms of address numbers along these 1-D roads, not in terms
of coordinates in 3-D space. Extracting these manifold coordinates is challenging,
but holds the promise to improve many machine learning algorithms. This general
principle is applied in many contexts. Figure 5.13 shows the manifold structure of
a dataset consisting of faces. By the end of this book, we will have developed the
methods necessary to learn such a manifold structure. In figure 20.6, we will see
how a machine learning algorithm can successfully accomplish this goal.
This concludes part I, which has provided the basic concepts in mathematics
and machine learning which are employed throughout the remaining parts of the
book. You are now prepared to embark upon your study of deep learning.
164
CHAPTER 5. MACHINE LEARNING BASICS
Figure 5.13: Training examples from the QMUL Multiview Face Dataset (Gong et al., 2000)
for which the subjects were asked to move in such a way as to cover the two-dimensional
manifold corresponding to two angles of rotation. We would like learning algorithms to be
able to discover and disentangle such manifold coordinates. Figure 20.6 illustrates such a
feat.
165